pith. machine review for the scientific record. sign in

arxiv: 2406.04264 · v3 · submitted 2024-06-06 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

MLVU: Benchmarking Multi-task Long Video Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords long video understandingmultimodal large language modelsbenchmark evaluationvideo analysiscontext lengthmodel performancediverse video genres
0
0 comments X

The pith

MLVU benchmark shows current multimodal models struggle with most long video tasks and degrade sharply on longer clips.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MLVU as a benchmark to evaluate long video understanding in multimodal large language models. It extends video durations across a wide range, incorporates many genres such as movies and surveillance footage, and defines multiple tasks that test core abilities like summarization and question answering. Testing 23 recent models reveals that all of them fail on most tasks, with clear drops in accuracy as video length increases. A sympathetic reader would care because real applications often require processing extended video where today's techniques appear limited, pointing to specific factors like context handling that need work.

Core claim

MLVU addresses prior benchmark limits by allowing flexible video lengths, covering diverse genres, and offering varied evaluation tasks; the study of 23 MLLMs shows every existing method struggles on most tasks and suffers severe performance degradation on longer videos, while also indicating that context length, image-understanding ability, and LLM backbone choice matter for progress.

What carries the argument

The MLVU benchmark itself, which supplies extended videos, genre variety, and diversified tasks to measure MLLMs' long-video understanding abilities.

If this is right

  • Improvements in context length handling would directly raise scores on longer videos.
  • Stronger image-understanding components and better LLM backbones would lift overall long-video performance.
  • Models must be tested across multiple genres to confirm they generalize beyond narrow cases.
  • Future work can use MLVU scores to track whether new techniques close the observed gaps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Success on MLVU could transfer to practical uses such as analyzing surveillance or summarizing extended recordings.
  • The benchmark may highlight whether scaling context windows alone solves the length degradation or if new architectures are required.
  • Adding even longer videos or additional task types in follow-ups would likely expose further limits in current designs.

Load-bearing premise

The chosen video lengths, genres, and tasks sufficiently represent the main challenges of real-world long video understanding.

What would settle it

A new model that maintains high accuracy on all MLVU tasks without measurable drop when video length increases would directly challenge the reported performance degradation.

read the original abstract

The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To address the above problems, we propose a new benchmark called MLVU (Multi-task Long Video Understanding Benchmark) for the comprehensive and in-depth evaluation of LVU. MLVU presents the following critical values: \textit{1)} The substantial and flexible extension of video lengths, which enables the benchmark to evaluate LVU performance across a wide range of durations. \textit{2)} The inclusion of various video genres, e.g., movies, surveillance footage, egocentric videos, cartoons, game videos, etc., which reflects the models' LVU performances in different scenarios. \textit{3)} The development of diversified evaluation tasks, which enables a comprehensive examination of MLLMs' key abilities in long-video understanding. The empirical study with 23 latest MLLMs reveals significant room for improvement in today's technique, as all existing methods struggle with most of the evaluation tasks and exhibit severe performance degradation when handling longer videos. Additionally, it suggests that factors such as context length, image-understanding ability, and the choice of LLM backbone can play critical roles in future advancements. We anticipate that MLVU will advance the research of long video understanding by providing a comprehensive and in-depth analysis of MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the MLVU benchmark for multi-task long video understanding, extending video lengths substantially, incorporating diverse genres (movies, surveillance, egocentric, cartoons, games), and defining multiple evaluation tasks to probe MLLM capabilities. It reports results from 23 recent MLLMs showing that all models struggle on most tasks and exhibit severe performance degradation as video duration increases, while suggesting that context length, image-understanding ability, and LLM backbone choice are key factors for future progress.

Significance. If the observed degradation can be shown to isolate long-range temporal reasoning rather than input-length artifacts, MLVU would supply a useful diagnostic benchmark that highlights concrete limitations in current MLLMs and could steer targeted improvements in context handling and temporal modeling.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (empirical study): the central claim that models exhibit 'severe performance degradation when handling longer videos' is not supported by any description of frame sampling rates, visual-token budgets, or context-window management. Without holding the number of frames per minute and total visual tokens constant while varying only duration, the reported drop cannot be unambiguously attributed to LVU comprehension failures rather than input constraints.
  2. [§3] §3 (benchmark construction): task definitions, metric formulations, statistical significance controls, and video-exclusion criteria are not detailed. These omissions leave the cross-model and cross-duration comparisons only moderately supported and make it difficult to assess whether the chosen tasks genuinely probe long-range reasoning.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'the inappropriateness for evaluating LVU performances' of prior benchmarks is stated without enumerating the specific shortcomings (e.g., length caps, task coverage) that MLVU is designed to remedy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas where additional methodological transparency is needed to strengthen the claims about performance degradation and the validity of the benchmark tasks. We have revised the manuscript to address both major comments by expanding the relevant sections with the requested details on input processing and task construction. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (empirical study): the central claim that models exhibit 'severe performance degradation when handling longer videos' is not supported by any description of frame sampling rates, visual-token budgets, or context-window management. Without holding the number of frames per minute and total visual tokens constant while varying only duration, the reported drop cannot be unambiguously attributed to LVU comprehension failures rather than input constraints.

    Authors: We agree that the original manuscript did not sufficiently document the input preprocessing pipeline, which limits the ability to isolate duration effects. In the revised version we have added a dedicated subsection in §4 describing the uniform frame-sampling strategy (fixed frames per minute across all durations), the per-model visual-token budget constraints, and the context-window truncation policy. With these controls held constant, the degradation trend remains statistically significant; we have included an additional figure and table that replot results under fixed token budgets to make this explicit. These changes directly address the concern and allow readers to attribute the drop more confidently to long-range reasoning limitations. revision: yes

  2. Referee: [§3] §3 (benchmark construction): task definitions, metric formulations, statistical significance controls, and video-exclusion criteria are not detailed. These omissions leave the cross-model and cross-duration comparisons only moderately supported and make it difficult to assess whether the chosen tasks genuinely probe long-range reasoning.

    Authors: We acknowledge that §3 was too concise on these points. The revised manuscript now contains expanded subsections that (1) provide formal definitions and input-output formats for each task, (2) specify the exact metrics (accuracy, mean average precision, or normalized edit distance as appropriate), (3) describe the bootstrapping procedure used for statistical significance, and (4) list the explicit exclusion criteria applied during video curation (e.g., minimum duration, genre balance, and quality filters). These additions make the benchmark construction reproducible and clarify how each task targets long-range temporal dependencies rather than short-term cues. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark proposal with direct external model evaluations

full rationale

The paper introduces the MLVU benchmark and reports direct empirical results from evaluating 23 third-party MLLMs on it. No derivations, equations, fitted parameters, or self-referential claims exist that reduce any reported result to a quantity defined by the authors' own inputs or prior work. The central claims rest on external model performance measurements rather than any internal construction or self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmark proposal paper. It relies on the domain assumption that existing LVU benchmarks are inadequate due to length and diversity limits, with no free parameters, mathematical axioms, or new postulated entities introduced.

axioms (1)
  • domain assumption Existing video understanding benchmarks are severely constrained by insufficient lengths, lack of diversity in video types and tasks, and inappropriateness for LVU evaluation.
    Directly stated in the abstract as the motivation for creating MLVU.

pith-pipeline@v0.9.0 · 5608 in / 1361 out tokens · 55765 ms · 2026-05-14T19:49:55.465952+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.

  2. SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.

  3. StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

    cs.AI 2026-04 unverdicted novelty 7.0

    StoryTR is a new benchmark and agentic data pipeline that adds explicit Theory of Mind reasoning chains to train smaller video retrieval models, yielding a 15% relative IoU gain over larger baselines on narrative content.

  4. Mosaic: Cross-Modal Clustering for Efficient Video Understanding

    cs.PF 2026-04 unverdicted novelty 7.0

    Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.

  5. AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.

  6. SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

    cs.CV 2026-04 unverdicted novelty 7.0

    SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

  7. Unified Reward Model for Multimodal Understanding and Generation

    cs.CV 2025-03 unverdicted novelty 7.0

    UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.

  8. Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    cs.CV 2025-01 unverdicted novelty 7.0

    Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.

  9. Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditi...

  10. One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...

  11. ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling

    cs.CV 2026-03 unverdicted novelty 6.0

    ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.

  12. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  13. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  14. SmolVLM: Redefining small and efficient multimodal models

    cs.AI 2025-04 unverdicted novelty 6.0

    SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.

  15. LLaVA-Video: Video Instruction Tuning With Synthetic Data

    cs.CV 2024-10 unverdicted novelty 6.0

    LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

  16. What Limits Vision-and-Language Navigation ?

    cs.RO 2026-05 unverdicted novelty 5.0

    StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.

  17. TTF: Temporal Token Fusion for Efficient Video-Language Model

    cs.CV 2026-05 unverdicted novelty 5.0

    TTF fuses temporally redundant visual tokens via local similarity search in a plug-and-play way, cutting ~67% tokens on Qwen3-VL-8B while retaining 99.5% accuracy with minimal overhead.

  18. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

  19. EasyVideoR1: Easier RL for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

  20. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

  21. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

  22. VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    cs.CV 2024-06 unverdicted novelty 4.0

    VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 22 Pith papers · 21 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 1, 6, 3

  2. [2]

    Claude 3

    Anthropic. Claude 3. https://www.anthropic.com/ news/claude-3-family, 2024. 7, 2

  3. [3]

    Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens

    Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Es- sam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elho- seiny. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. arXiv preprint arXiv:2404.03413, 2024. 6, 7, 2

  4. [4]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609,

  5. [5]

    Frozen in time: A joint video and image encoder for end-to- end retrieval

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to- end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021. 4, 3

  6. [6]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. 1

  7. [7]

    Sharegpt4video: Improving video understanding and generation with better captions, 2024

    Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understand- ing and generation with better captions. arXiv preprint arXiv:2406.04325, 2024. 7, 2

  8. [8]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 2, 6, 7

  9. [9]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 6, 7, 2

  10. [10]

    Minedojo: Building open-ended embodied agents with internet-scale knowledge

    Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35: 18343–18362, 2022. 1

  11. [11]

    Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

    Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos. arXiv preprint arXiv:2408.14023, 2024. 2, 6, 7

  12. [12]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehen- sive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 1, 3

  13. [13]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 1, 2

  14. [14]

    Data engineering for scaling language models to 128K context.arXiv preprint arXiv:2402.10171, 2024

    Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171, 2024. 6

  15. [15]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 1, 3, 4

  16. [16]

    Ma-lmm: Memory-augmented large multimodal model for long-term video understanding.arXiv preprint arXiv:2404.05726, 2024

    Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xue- fei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. arXiv preprint arXiv:2404.05726, 2024. 3, 7, 2

  17. [17]

    Movienet: A holistic dataset for movie understanding

    Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie understanding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 709–727. Springer, 2020. 2, 3

  18. [18]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset. arXiv preprint arXiv:1705.06950,

  19. [19]

    Complex video rea- soning and robustness evaluation suite for video-lmms.arXiv preprint arXiv:2405.03690, 2024

    Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan, Muzammal Naseer, Federico Tombari, Fa- had Shahbaz Khan, and Salman Khan. Complex video rea- soning and robustness evaluation suite for video-lmms. arXiv preprint arXiv:2405.03690, 2024. 1

  20. [20]

    Dense-captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017. 6

  21. [21]

    TVQA: Localized, Compositional Video Question Answering

    Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696, 2018. 2, 4

  22. [22]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 1

  23. [23]

    Otter: A Multi-Modal Model with In-Context Instruction Tuning

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model 9 with in-context instruction tuning. CoRR, abs/2305.03726,

  24. [24]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chun- yuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6, 7

  25. [25]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 2, 7

  26. [26]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark.arXiv preprint arXiv:2311.17005, 2023

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. arXiv preprint arXiv:2311.17005, 2023. 1, 2, 3, 7

  27. [27]

    Llama-vid: An image is worth 2 tokens in large language models.arXiv preprint arXiv:2311.17043, 2023

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023. 1, 2, 3, 7

  28. [28]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 2, 7

  29. [29]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2023. 2, 7

  30. [30]

    World model on million-length video and language with blockwise ringattention

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024. 3

  31. [31]

    Lost in the middle: How language models use long con- texts

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long con- texts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024. 4

  32. [32]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 1

  33. [33]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 2

  34. [34]

    Egoschema: A diagnostic benchmark for very long- form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. Advances in Neural Information Processing Systems, 36, 2023. 1, 2, 3

  35. [35]

    Howto100m: Learning a text-video embedding by watching hundred mil- lion narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred mil- lion narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630– 2640, 2019. 3

  36. [36]

    Video-bench: A comprehen- sive benchmark and toolkit for evaluating video-based large language models

    Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehen- sive benchmark and toolkit for evaluating video-based large language models. arXiv preprint arXiv:2311.16103, 2023. 1

  37. [37]

    OpenAI. Gpt-4o. https://openai.com/index/ hello-gpt-4o/, 2024. 6, 7, 2

  38. [38]

    Retrieving-to-answer: Zero-shot video question answering with frozen large lan- guage models

    Junting Pan, Ziyi Lin, Yuying Ge, Xiatian Zhu, Renrui Zhang, Yi Wang, Yu Qiao, and Hongsheng Li. Retrieving-to-answer: Zero-shot video question answering with frozen large lan- guage models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 272–283, 2023. 3

  39. [39]

    Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

    Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. arXiv preprint arXiv:2312.02051, 2023. 2

  40. [40]

    Video-xl: Extra-long vision language model for hour-scale video understanding

    Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. arXiv preprint arXiv:2409.14485, 2024. 2, 3, 6, 7

  41. [41]

    Moviechat: From dense to- ken to sparse memory for long video understanding.arXiv preprint arXiv:2307.16449, 2023

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023. 1, 2, 3, 7, 4

  42. [42]

    Moviellm: Enhancing long video understanding with ai-generated movies

    Zhende Song, Chenchen Wang, Jiamu Sheng, Chi Zhang, Gang Yu, Jiayuan Fan, and Tao Chen. Moviellm: Enhancing long video understanding with ai-generated movies. arXiv preprint arXiv:2403.01422, 2024. 7, 2

  43. [43]

    Real-world anomaly detection in surveillance videos

    Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018. 4, 1, 2

  44. [44]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 1

  45. [45]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am- jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 1

  46. [46]

    Videoagent: Long-form video understand- ing with large language model as agent.arXiv preprint arXiv:2403.10517, 2024

    Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung- Levy. Videoagent: Long-form video understanding with large language model as agent. arXiv preprint arXiv:2403.10517,

  47. [47]

    Pax- ion: Patching action knowledge in video-language founda- tion models

    Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, and Heng Ji. Pax- ion: Patching action knowledge in video-language founda- tion models. Advances in Neural Information Processing Systems, 36, 2023. 3

  48. [48]

    Star: A benchmark for situated reasoning in real-world videos

    Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth conference on neural information processing systems datasets and benchmarks track (Round 2), 2021. 3

  49. [49]

    Longvideobench: A benchmark for long-context inter- leaved video-language understanding.arXiv Preprint, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context inter- leaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024. 2

  50. [50]

    Next-qa: Next phase of question-answering to explaining tem- poral actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining tem- poral actions. In Proceedings of the IEEE/CVF conference 10 on computer vision and pattern recognition, pages 9777– 9786, 2021. 2, 3

  51. [51]

    Funqa: Towards surprising video comprehension

    Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, and Ziwei Liu. Funqa: Towards surprising video comprehension. arXiv preprint arXiv:2306.14899, 2023. 3

  52. [52]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016. 1, 2, 3

  53. [53]

    Retrieval-based video language model for efficient long video question answering.arXiv preprint arXiv:2312.04931,

    Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, and Yan Lu. Retrieval-based video language model for efficient long video question answering. arXiv preprint arXiv:2312.04931,

  54. [54]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 2, 7

  55. [55]

    Clevrer: Collision events for video representation and reasoning

    Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning. In International Conference on Learning Representations, 2019. 4

  56. [56]

    Activitynet-qa: A dataset for understanding complex web videos via question answer- ing

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answer- ing. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9127–9134, 2019. 3

  57. [57]

    MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023. 1, 3, 8

  58. [58]

    Movie101: A new movie understanding bench- mark

    Zihao Yue, Qi Zhang, Anwen Hu, Liang Zhang, Ziheng Wang, and Qin Jin. Movie101: A new movie understanding bench- mark. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 4669–4684, 2023. 2, 6, 1

  59. [59]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. arXiv preprint arXiv:2306.02858, 2023. 2, 7

  60. [60]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024. 1, 2, 3, 6, 7

  61. [61]

    ∞ bench: Extending long context evaluation beyond 100k tokens

    Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, et al. ∞ bench: Extending long context evaluation beyond 100k tokens. arXiv preprint arXiv:2402.13718, 2024. 6

  62. [62]

    Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment

    Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852, 2023. 6

  63. [63]

    elderlyindividual

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision- language understanding with advanced large language mod- els. In The Twelfth International Conference on Learning Representations, 2023. 2 11 MLVU: Benchmarking Multi-task Long Video Understanding Supplementary Material A. Overview of Appendix • B: Evaluation...