arxiv: 2406.04264 · v3 · submitted 2024-06-06 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou , Yan Shu , Bo Zhao , Boya Wu , Zhengyang Liang , Shitao Xiao , Minghao Qin , Xi Yang

show 4 more authors

Yongping Xiong Bo Zhang Tiejun Huang Zheng Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords long video understandingmultimodal large language modelsbenchmark evaluationvideo analysiscontext lengthmodel performancediverse video genres

0 comments

The pith

MLVU benchmark shows current multimodal models struggle with most long video tasks and degrade sharply on longer clips.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MLVU as a benchmark to evaluate long video understanding in multimodal large language models. It extends video durations across a wide range, incorporates many genres such as movies and surveillance footage, and defines multiple tasks that test core abilities like summarization and question answering. Testing 23 recent models reveals that all of them fail on most tasks, with clear drops in accuracy as video length increases. A sympathetic reader would care because real applications often require processing extended video where today's techniques appear limited, pointing to specific factors like context handling that need work.

Core claim

MLVU addresses prior benchmark limits by allowing flexible video lengths, covering diverse genres, and offering varied evaluation tasks; the study of 23 MLLMs shows every existing method struggles on most tasks and suffers severe performance degradation on longer videos, while also indicating that context length, image-understanding ability, and LLM backbone choice matter for progress.

What carries the argument

The MLVU benchmark itself, which supplies extended videos, genre variety, and diversified tasks to measure MLLMs' long-video understanding abilities.

If this is right

Improvements in context length handling would directly raise scores on longer videos.
Stronger image-understanding components and better LLM backbones would lift overall long-video performance.
Models must be tested across multiple genres to confirm they generalize beyond narrow cases.
Future work can use MLVU scores to track whether new techniques close the observed gaps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Success on MLVU could transfer to practical uses such as analyzing surveillance or summarizing extended recordings.
The benchmark may highlight whether scaling context windows alone solves the length degradation or if new architectures are required.
Adding even longer videos or additional task types in follow-ups would likely expose further limits in current designs.

Load-bearing premise

The chosen video lengths, genres, and tasks sufficiently represent the main challenges of real-world long video understanding.

What would settle it

A new model that maintains high accuracy on all MLVU tasks without measurable drop when video length increases would directly challenge the reported performance degradation.

read the original abstract

The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To address the above problems, we propose a new benchmark called MLVU (Multi-task Long Video Understanding Benchmark) for the comprehensive and in-depth evaluation of LVU. MLVU presents the following critical values: \textit{1)} The substantial and flexible extension of video lengths, which enables the benchmark to evaluate LVU performance across a wide range of durations. \textit{2)} The inclusion of various video genres, e.g., movies, surveillance footage, egocentric videos, cartoons, game videos, etc., which reflects the models' LVU performances in different scenarios. \textit{3)} The development of diversified evaluation tasks, which enables a comprehensive examination of MLLMs' key abilities in long-video understanding. The empirical study with 23 latest MLLMs reveals significant room for improvement in today's technique, as all existing methods struggle with most of the evaluation tasks and exhibit severe performance degradation when handling longer videos. Additionally, it suggests that factors such as context length, image-understanding ability, and the choice of LLM backbone can play critical roles in future advancements. We anticipate that MLVU will advance the research of long video understanding by providing a comprehensive and in-depth analysis of MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MLVU adds flexible long durations and genre breadth to video benchmarks but the length-degradation claim rests on unstated controls for frames and tokens.

read the letter

MLVU is a new benchmark for long video understanding that extends durations flexibly and broadens the genres and tasks covered. The tests on 23 models show clear drops in performance as videos lengthen, which highlights current shortcomings. The paper does a good job of addressing gaps in prior LVU benchmarks by allowing variable lengths up to longer scales and pulling in surveillance footage, egocentric videos, cartoons, and games. The multi-task setup lets them probe different skills like summarization or question answering over extended clips. Running the same models across these gives a consistent picture of where things stand today. What stands out is the empirical breadth: all models struggle with most tasks, and longer videos make it worse across the board. The authors also note that context length, image understanding, and the LLM backbone matter for future work. This kind of data can guide targeted fixes. The soft spot is the length degradation result. The abstract does not explain how frame rates or token allocations change with video duration. If longer videos are downsampled more aggressively or push against context limits, the performance hit could come from input constraints rather than an inability to handle long-range dependencies. That leaves the main claim only partially supported until those details are checked. Task definitions and statistical handling get little space in the summary too, so it's unclear how robust the metrics are or whether video selection introduces biases. This work is for multimodal researchers focused on video and language models. Anyone evaluating long-video capabilities would find the benchmark and baselines useful as a starting point. I would send it for peer review. The idea is timely and the scale of the evaluation is enough to merit detailed feedback on the methods.

Referee Report

2 major / 1 minor

Summary. The paper introduces the MLVU benchmark for multi-task long video understanding, extending video lengths substantially, incorporating diverse genres (movies, surveillance, egocentric, cartoons, games), and defining multiple evaluation tasks to probe MLLM capabilities. It reports results from 23 recent MLLMs showing that all models struggle on most tasks and exhibit severe performance degradation as video duration increases, while suggesting that context length, image-understanding ability, and LLM backbone choice are key factors for future progress.

Significance. If the observed degradation can be shown to isolate long-range temporal reasoning rather than input-length artifacts, MLVU would supply a useful diagnostic benchmark that highlights concrete limitations in current MLLMs and could steer targeted improvements in context handling and temporal modeling.

major comments (2)

[Abstract and §4] Abstract and §4 (empirical study): the central claim that models exhibit 'severe performance degradation when handling longer videos' is not supported by any description of frame sampling rates, visual-token budgets, or context-window management. Without holding the number of frames per minute and total visual tokens constant while varying only duration, the reported drop cannot be unambiguously attributed to LVU comprehension failures rather than input constraints.
[§3] §3 (benchmark construction): task definitions, metric formulations, statistical significance controls, and video-exclusion criteria are not detailed. These omissions leave the cross-model and cross-duration comparisons only moderately supported and make it difficult to assess whether the chosen tasks genuinely probe long-range reasoning.

minor comments (1)

[Abstract] Abstract: the phrase 'the inappropriateness for evaluating LVU performances' of prior benchmarks is stated without enumerating the specific shortcomings (e.g., length caps, task coverage) that MLVU is designed to remedy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas where additional methodological transparency is needed to strengthen the claims about performance degradation and the validity of the benchmark tasks. We have revised the manuscript to address both major comments by expanding the relevant sections with the requested details on input processing and task construction. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (empirical study): the central claim that models exhibit 'severe performance degradation when handling longer videos' is not supported by any description of frame sampling rates, visual-token budgets, or context-window management. Without holding the number of frames per minute and total visual tokens constant while varying only duration, the reported drop cannot be unambiguously attributed to LVU comprehension failures rather than input constraints.

Authors: We agree that the original manuscript did not sufficiently document the input preprocessing pipeline, which limits the ability to isolate duration effects. In the revised version we have added a dedicated subsection in §4 describing the uniform frame-sampling strategy (fixed frames per minute across all durations), the per-model visual-token budget constraints, and the context-window truncation policy. With these controls held constant, the degradation trend remains statistically significant; we have included an additional figure and table that replot results under fixed token budgets to make this explicit. These changes directly address the concern and allow readers to attribute the drop more confidently to long-range reasoning limitations. revision: yes
Referee: [§3] §3 (benchmark construction): task definitions, metric formulations, statistical significance controls, and video-exclusion criteria are not detailed. These omissions leave the cross-model and cross-duration comparisons only moderately supported and make it difficult to assess whether the chosen tasks genuinely probe long-range reasoning.

Authors: We acknowledge that §3 was too concise on these points. The revised manuscript now contains expanded subsections that (1) provide formal definitions and input-output formats for each task, (2) specify the exact metrics (accuracy, mean average precision, or normalized edit distance as appropriate), (3) describe the bootstrapping procedure used for statistical significance, and (4) list the explicit exclusion criteria applied during video curation (e.g., minimum duration, genre balance, and quality filters). These additions make the benchmark construction reproducible and clarify how each task targets long-range temporal dependencies rather than short-term cues. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark proposal with direct external model evaluations

full rationale

The paper introduces the MLVU benchmark and reports direct empirical results from evaluating 23 third-party MLLMs on it. No derivations, equations, fitted parameters, or self-referential claims exist that reduce any reported result to a quantity defined by the authors' own inputs or prior work. The central claims rest on external model performance measurements rather than any internal construction or self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmark proposal paper. It relies on the domain assumption that existing LVU benchmarks are inadequate due to length and diversity limits, with no free parameters, mathematical axioms, or new postulated entities introduced.

axioms (1)

domain assumption Existing video understanding benchmarks are severely constrained by insufficient lengths, lack of diversity in video types and tasks, and inappropriateness for LVU evaluation.
Directly stated in the abstract as the motivation for creating MLVU.

pith-pipeline@v0.9.0 · 5608 in / 1361 out tokens · 55765 ms · 2026-05-14T19:49:55.465952+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding
cs.CV 2026-05 unverdicted novelty 7.0

SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.
StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning
cs.AI 2026-04 unverdicted novelty 7.0

StoryTR is a new benchmark and agentic data pipeline that adds explicit Theory of Mind reasoning chains to train smaller video retrieval models, yielding a 15% relative IoU gain over larger baselines on narrative content.
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
cs.PF 2026-04 unverdicted novelty 7.0

Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
cs.CV 2026-04 unverdicted novelty 7.0

AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
cs.CV 2026-04 unverdicted novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
Unified Reward Model for Multimodal Understanding and Generation
cs.CV 2025-03 unverdicted novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
cs.CV 2025-01 unverdicted novelty 7.0

Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditi...
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
cs.CV 2026-03 unverdicted novelty 6.0

ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
SmolVLM: Redefining small and efficient multimodal models
cs.AI 2025-04 unverdicted novelty 6.0

SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
What Limits Vision-and-Language Navigation ?
cs.RO 2026-05 unverdicted novelty 5.0

StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.
TTF: Temporal Token Fusion for Efficient Video-Language Model
cs.CV 2026-05 unverdicted novelty 5.0

TTF fuses temporally redundant visual tokens via local similarity search in a plug-and-play way, cutting ~67% tokens on Qwen3-VL-8B while retaining 99.5% accuracy with minimal overhead.
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
EasyVideoR1: Easier RL for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
cs.CV 2024-06 unverdicted novelty 4.0

VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 22 Pith papers · 21 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 1, 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Claude 3

Anthropic. Claude 3. https://www.anthropic.com/ news/claude-3-family, 2024. 7, 2

work page 2024
[3]

Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens

Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Es- sam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elho- seiny. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. arXiv preprint arXiv:2404.03413, 2024. 6, 7, 2

work page arXiv 2024
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Frozen in time: A joint video and image encoder for end-to- end retrieval

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to- end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021. 4, 3

work page 2021
[6]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. 1

work page 1901
[7]

Sharegpt4video: Improving video understanding and generation with better captions, 2024

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understand- ing and generation with better captions. arXiv preprint arXiv:2406.04325, 2024. 7, 2

work page arXiv 2024
[8]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 6, 7, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Minedojo: Building open-ended embodied agents with internet-scale knowledge

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35: 18343–18362, 2022. 1

work page 2022
[11]

Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos. arXiv preprint arXiv:2408.14023, 2024. 2, 6, 7

work page arXiv 2024
[12]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehen- sive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Data engineering for scaling language models to 128K context.arXiv preprint arXiv:2402.10171, 2024

Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171, 2024. 6

work page arXiv 2024
[15]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 1, 3, 4

work page 2022
[16]

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding.arXiv preprint arXiv:2404.05726, 2024

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xue- fei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. arXiv preprint arXiv:2404.05726, 2024. 3, 7, 2

work page arXiv 2024
[17]

Movienet: A holistic dataset for movie understanding

Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie understanding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 709–727. Springer, 2020. 2, 3

work page 2020
[18]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset. arXiv preprint arXiv:1705.06950,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Complex video rea- soning and robustness evaluation suite for video-lmms.arXiv preprint arXiv:2405.03690, 2024

Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan, Muzammal Naseer, Federico Tombari, Fa- had Shahbaz Khan, and Salman Khan. Complex video rea- soning and robustness evaluation suite for video-lmms. arXiv preprint arXiv:2405.03690, 2024. 1

work page arXiv 2024
[20]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017. 6

work page 2017
[21]

TVQA: Localized, Compositional Video Question Answering

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696, 2018. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model 9 with in-context instruction tuning. CoRR, abs/2305.03726,

work page internal anchor Pith review arXiv
[24]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chun- yuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Mvbench: A comprehensive multi-modal video understand- ing benchmark.arXiv preprint arXiv:2311.17005, 2023

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. arXiv preprint arXiv:2311.17005, 2023. 1, 2, 3, 7

work page arXiv 2023
[27]

Llama-vid: An image is worth 2 tokens in large language models.arXiv preprint arXiv:2311.17043, 2023

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023. 1, 2, 3, 7

work page arXiv 2023
[28]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2023. 2, 7

work page 2023
[30]

World model on million-length video and language with blockwise ringattention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024. 3

work page arXiv 2024
[31]

Lost in the middle: How language models use long con- texts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long con- texts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024. 4

work page 2024
[32]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 2

work page internal anchor Pith review arXiv 2023
[34]

Egoschema: A diagnostic benchmark for very long- form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. Advances in Neural Information Processing Systems, 36, 2023. 1, 2, 3

work page 2023
[35]

Howto100m: Learning a text-video embedding by watching hundred mil- lion narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred mil- lion narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630– 2640, 2019. 3

work page 2019
[36]

Video-bench: A comprehen- sive benchmark and toolkit for evaluating video-based large language models

Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehen- sive benchmark and toolkit for evaluating video-based large language models. arXiv preprint arXiv:2311.16103, 2023. 1

work page arXiv 2023
[37]

OpenAI. Gpt-4o. https://openai.com/index/ hello-gpt-4o/, 2024. 6, 7, 2

work page 2024
[38]

Retrieving-to-answer: Zero-shot video question answering with frozen large lan- guage models

Junting Pan, Ziyi Lin, Yuying Ge, Xiatian Zhu, Renrui Zhang, Yi Wang, Yu Qiao, and Hongsheng Li. Retrieving-to-answer: Zero-shot video question answering with frozen large lan- guage models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 272–283, 2023. 3

work page 2023
[39]

Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. arXiv preprint arXiv:2312.02051, 2023. 2

work page arXiv 2023
[40]

Video-xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. arXiv preprint arXiv:2409.14485, 2024. 2, 3, 6, 7

work page arXiv 2024
[41]

Moviechat: From dense to- ken to sparse memory for long video understanding.arXiv preprint arXiv:2307.16449, 2023

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023. 1, 2, 3, 7, 4

work page arXiv 2023
[42]

Moviellm: Enhancing long video understanding with ai-generated movies

Zhende Song, Chenchen Wang, Jiamu Sheng, Chi Zhang, Gang Yu, Jiayuan Fan, and Tao Chen. Moviellm: Enhancing long video understanding with ai-generated movies. arXiv preprint arXiv:2403.01422, 2024. 7, 2

work page arXiv 2024
[43]

Real-world anomaly detection in surveillance videos

Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018. 4, 1, 2

work page 2018
[44]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am- jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Videoagent: Long-form video understand- ing with large language model as agent.arXiv preprint arXiv:2403.10517, 2024

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung- Levy. Videoagent: Long-form video understanding with large language model as agent. arXiv preprint arXiv:2403.10517,

work page arXiv
[47]

Pax- ion: Patching action knowledge in video-language founda- tion models

Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, and Heng Ji. Pax- ion: Patching action knowledge in video-language founda- tion models. Advances in Neural Information Processing Systems, 36, 2023. 3

work page 2023
[48]

Star: A benchmark for situated reasoning in real-world videos

Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth conference on neural information processing systems datasets and benchmarks track (Round 2), 2021. 3

work page 2021
[49]

Longvideobench: A benchmark for long-context inter- leaved video-language understanding.arXiv Preprint, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context inter- leaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024. 2

work page arXiv 2024
[50]

Next-qa: Next phase of question-answering to explaining tem- poral actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining tem- poral actions. In Proceedings of the IEEE/CVF conference 10 on computer vision and pattern recognition, pages 9777– 9786, 2021. 2, 3

work page 2021
[51]

Funqa: Towards surprising video comprehension

Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, and Ziwei Liu. Funqa: Towards surprising video comprehension. arXiv preprint arXiv:2306.14899, 2023. 3

work page arXiv 2023
[52]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016. 1, 2, 3

work page 2016
[53]

Retrieval-based video language model for efficient long video question answering.arXiv preprint arXiv:2312.04931,

Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, and Yan Lu. Retrieval-based video language model for efficient long video question answering. arXiv preprint arXiv:2312.04931,

work page arXiv
[54]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Clevrer: Collision events for video representation and reasoning

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning. In International Conference on Learning Representations, 2019. 4

work page 2019
[56]

Activitynet-qa: A dataset for understanding complex web videos via question answer- ing

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answer- ing. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9127–9134, 2019. 3

work page 2019
[57]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023. 1, 3, 8

work page internal anchor Pith review arXiv 2023
[58]

Movie101: A new movie understanding bench- mark

Zihao Yue, Qi Zhang, Anwen Hu, Liang Zhang, Ziheng Wang, and Qin Jin. Movie101: A new movie understanding bench- mark. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 4669–4684, 2023. 2, 6, 1

work page 2023
[59]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. arXiv preprint arXiv:2306.02858, 2023. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024. 1, 2, 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

∞ bench: Extending long context evaluation beyond 100k tokens

Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, et al. ∞ bench: Extending long context evaluation beyond 100k tokens. arXiv preprint arXiv:2402.13718, 2024. 6

work page arXiv 2024
[62]

Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852, 2023. 6

work page arXiv 2023
[63]

elderlyindividual

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision- language understanding with advanced large language mod- els. In The Twelfth International Conference on Learning Representations, 2023. 2 11 MLVU: Benchmarking Multi-task Long Video Understanding Supplementary Material A. Overview of Appendix • B: Evaluation...

work page 2023