Recognition: 2 theorem links
· Lean TheoremMLVU: Benchmarking Multi-task Long Video Understanding
Pith reviewed 2026-05-14 19:49 UTC · model grok-4.3
The pith
MLVU benchmark shows current multimodal models struggle with most long video tasks and degrade sharply on longer clips.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MLVU addresses prior benchmark limits by allowing flexible video lengths, covering diverse genres, and offering varied evaluation tasks; the study of 23 MLLMs shows every existing method struggles on most tasks and suffers severe performance degradation on longer videos, while also indicating that context length, image-understanding ability, and LLM backbone choice matter for progress.
What carries the argument
The MLVU benchmark itself, which supplies extended videos, genre variety, and diversified tasks to measure MLLMs' long-video understanding abilities.
If this is right
- Improvements in context length handling would directly raise scores on longer videos.
- Stronger image-understanding components and better LLM backbones would lift overall long-video performance.
- Models must be tested across multiple genres to confirm they generalize beyond narrow cases.
- Future work can use MLVU scores to track whether new techniques close the observed gaps.
Where Pith is reading between the lines
- Success on MLVU could transfer to practical uses such as analyzing surveillance or summarizing extended recordings.
- The benchmark may highlight whether scaling context windows alone solves the length degradation or if new architectures are required.
- Adding even longer videos or additional task types in follow-ups would likely expose further limits in current designs.
Load-bearing premise
The chosen video lengths, genres, and tasks sufficiently represent the main challenges of real-world long video understanding.
What would settle it
A new model that maintains high accuracy on all MLVU tasks without measurable drop when video length increases would directly challenge the reported performance degradation.
read the original abstract
The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To address the above problems, we propose a new benchmark called MLVU (Multi-task Long Video Understanding Benchmark) for the comprehensive and in-depth evaluation of LVU. MLVU presents the following critical values: \textit{1)} The substantial and flexible extension of video lengths, which enables the benchmark to evaluate LVU performance across a wide range of durations. \textit{2)} The inclusion of various video genres, e.g., movies, surveillance footage, egocentric videos, cartoons, game videos, etc., which reflects the models' LVU performances in different scenarios. \textit{3)} The development of diversified evaluation tasks, which enables a comprehensive examination of MLLMs' key abilities in long-video understanding. The empirical study with 23 latest MLLMs reveals significant room for improvement in today's technique, as all existing methods struggle with most of the evaluation tasks and exhibit severe performance degradation when handling longer videos. Additionally, it suggests that factors such as context length, image-understanding ability, and the choice of LLM backbone can play critical roles in future advancements. We anticipate that MLVU will advance the research of long video understanding by providing a comprehensive and in-depth analysis of MLLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the MLVU benchmark for multi-task long video understanding, extending video lengths substantially, incorporating diverse genres (movies, surveillance, egocentric, cartoons, games), and defining multiple evaluation tasks to probe MLLM capabilities. It reports results from 23 recent MLLMs showing that all models struggle on most tasks and exhibit severe performance degradation as video duration increases, while suggesting that context length, image-understanding ability, and LLM backbone choice are key factors for future progress.
Significance. If the observed degradation can be shown to isolate long-range temporal reasoning rather than input-length artifacts, MLVU would supply a useful diagnostic benchmark that highlights concrete limitations in current MLLMs and could steer targeted improvements in context handling and temporal modeling.
major comments (2)
- [Abstract and §4] Abstract and §4 (empirical study): the central claim that models exhibit 'severe performance degradation when handling longer videos' is not supported by any description of frame sampling rates, visual-token budgets, or context-window management. Without holding the number of frames per minute and total visual tokens constant while varying only duration, the reported drop cannot be unambiguously attributed to LVU comprehension failures rather than input constraints.
- [§3] §3 (benchmark construction): task definitions, metric formulations, statistical significance controls, and video-exclusion criteria are not detailed. These omissions leave the cross-model and cross-duration comparisons only moderately supported and make it difficult to assess whether the chosen tasks genuinely probe long-range reasoning.
minor comments (1)
- [Abstract] Abstract: the phrase 'the inappropriateness for evaluating LVU performances' of prior benchmarks is stated without enumerating the specific shortcomings (e.g., length caps, task coverage) that MLVU is designed to remedy.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments highlight important areas where additional methodological transparency is needed to strengthen the claims about performance degradation and the validity of the benchmark tasks. We have revised the manuscript to address both major comments by expanding the relevant sections with the requested details on input processing and task construction. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (empirical study): the central claim that models exhibit 'severe performance degradation when handling longer videos' is not supported by any description of frame sampling rates, visual-token budgets, or context-window management. Without holding the number of frames per minute and total visual tokens constant while varying only duration, the reported drop cannot be unambiguously attributed to LVU comprehension failures rather than input constraints.
Authors: We agree that the original manuscript did not sufficiently document the input preprocessing pipeline, which limits the ability to isolate duration effects. In the revised version we have added a dedicated subsection in §4 describing the uniform frame-sampling strategy (fixed frames per minute across all durations), the per-model visual-token budget constraints, and the context-window truncation policy. With these controls held constant, the degradation trend remains statistically significant; we have included an additional figure and table that replot results under fixed token budgets to make this explicit. These changes directly address the concern and allow readers to attribute the drop more confidently to long-range reasoning limitations. revision: yes
-
Referee: [§3] §3 (benchmark construction): task definitions, metric formulations, statistical significance controls, and video-exclusion criteria are not detailed. These omissions leave the cross-model and cross-duration comparisons only moderately supported and make it difficult to assess whether the chosen tasks genuinely probe long-range reasoning.
Authors: We acknowledge that §3 was too concise on these points. The revised manuscript now contains expanded subsections that (1) provide formal definitions and input-output formats for each task, (2) specify the exact metrics (accuracy, mean average precision, or normalized edit distance as appropriate), (3) describe the bootstrapping procedure used for statistical significance, and (4) list the explicit exclusion criteria applied during video curation (e.g., minimum duration, genre balance, and quality filters). These additions make the benchmark construction reproducible and clarify how each task targets long-range temporal dependencies rather than short-term cues. revision: yes
Circularity Check
No circularity: benchmark proposal with direct external model evaluations
full rationale
The paper introduces the MLVU benchmark and reports direct empirical results from evaluating 23 third-party MLLMs on it. No derivations, equations, fitted parameters, or self-referential claims exist that reduce any reported result to a quantity defined by the authors' own inputs or prior work. The central claims rest on external model performance measurements rather than any internal construction or self-citation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing video understanding benchmarks are severely constrained by insufficient lengths, lack of diversity in video types and tasks, and inappropriateness for LVU evaluation.
Forward citations
Cited by 22 Pith papers
-
CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
-
SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding
SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.
-
StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning
StoryTR is a new benchmark and agentic data pipeline that adds explicit Theory of Mind reasoning chains to train smaller video retrieval models, yielding a 15% relative IoU gain over larger baselines on narrative content.
-
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
-
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
-
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
-
Unified Reward Model for Multimodal Understanding and Generation
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
-
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
-
Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models
VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditi...
-
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
-
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
SmolVLM: Redefining small and efficient multimodal models
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
-
LLaVA-Video: Video Instruction Tuning With Synthetic Data
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
-
What Limits Vision-and-Language Navigation ?
StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.
-
TTF: Temporal Token Fusion for Efficient Video-Language Model
TTF fuses temporally redundant visual tokens via local similarity search in a plug-and-play way, cutting ~67% tokens on Qwen3-VL-8B while retaining 99.5% accuracy with minimal overhead.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
-
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 1, 6, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [2]
-
[3]
Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Es- sam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elho- seiny. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. arXiv preprint arXiv:2404.03413, 2024. 6, 7, 2
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Frozen in time: A joint video and image encoder for end-to- end retrieval
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to- end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021. 4, 3
work page 2021
-
[6]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. 1
work page 1901
-
[7]
Sharegpt4video: Improving video understanding and generation with better captions, 2024
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understand- ing and generation with better captions. arXiv preprint arXiv:2406.04325, 2024. 7, 2
-
[8]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 2, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 6, 7, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Minedojo: Building open-ended embodied agents with internet-scale knowledge
Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35: 18343–18362, 2022. 1
work page 2022
-
[11]
Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos. arXiv preprint arXiv:2408.14023, 2024. 2, 6, 7
-
[12]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehen- sive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Data engineering for scaling language models to 128K context.arXiv preprint arXiv:2402.10171, 2024
Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171, 2024. 6
-
[15]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 1, 3, 4
work page 2022
-
[16]
Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xue- fei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. arXiv preprint arXiv:2404.05726, 2024. 3, 7, 2
-
[17]
Movienet: A holistic dataset for movie understanding
Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie understanding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 709–727. Springer, 2020. 2, 3
work page 2020
-
[18]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset. arXiv preprint arXiv:1705.06950,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan, Muzammal Naseer, Federico Tombari, Fa- had Shahbaz Khan, and Salman Khan. Complex video rea- soning and robustness evaluation suite for video-lmms. arXiv preprint arXiv:2405.03690, 2024. 1
-
[20]
Dense-captioning events in videos
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017. 6
work page 2017
-
[21]
TVQA: Localized, Compositional Video Question Answering
Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696, 2018. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[22]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model 9 with in-context instruction tuning. CoRR, abs/2305.03726,
work page internal anchor Pith review arXiv
-
[24]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chun- yuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. arXiv preprint arXiv:2311.17005, 2023. 1, 2, 3, 7
-
[27]
Llama-vid: An image is worth 2 tokens in large language models.arXiv preprint arXiv:2311.17043, 2023
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023. 1, 2, 3, 7
-
[28]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2023. 2, 7
work page 2023
-
[30]
World model on million-length video and language with blockwise ringattention
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024. 3
-
[31]
Lost in the middle: How language models use long con- texts
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long con- texts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024. 4
work page 2024
-
[32]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 2
work page internal anchor Pith review arXiv 2023
-
[34]
Egoschema: A diagnostic benchmark for very long- form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. Advances in Neural Information Processing Systems, 36, 2023. 1, 2, 3
work page 2023
-
[35]
Howto100m: Learning a text-video embedding by watching hundred mil- lion narrated video clips
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred mil- lion narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630– 2640, 2019. 3
work page 2019
-
[36]
Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehen- sive benchmark and toolkit for evaluating video-based large language models. arXiv preprint arXiv:2311.16103, 2023. 1
-
[37]
OpenAI. Gpt-4o. https://openai.com/index/ hello-gpt-4o/, 2024. 6, 7, 2
work page 2024
-
[38]
Retrieving-to-answer: Zero-shot video question answering with frozen large lan- guage models
Junting Pan, Ziyi Lin, Yuying Ge, Xiatian Zhu, Renrui Zhang, Yi Wang, Yu Qiao, and Hongsheng Li. Retrieving-to-answer: Zero-shot video question answering with frozen large lan- guage models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 272–283, 2023. 3
work page 2023
-
[39]
Timechat: A time-sensitive multimodal large lan- guage model for long video understanding
Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. arXiv preprint arXiv:2312.02051, 2023. 2
-
[40]
Video-xl: Extra-long vision language model for hour-scale video understanding
Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. arXiv preprint arXiv:2409.14485, 2024. 2, 3, 6, 7
-
[41]
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023. 1, 2, 3, 7, 4
-
[42]
Moviellm: Enhancing long video understanding with ai-generated movies
Zhende Song, Chenchen Wang, Jiamu Sheng, Chi Zhang, Gang Yu, Jiayuan Fan, and Tao Chen. Moviellm: Enhancing long video understanding with ai-generated movies. arXiv preprint arXiv:2403.01422, 2024. 7, 2
-
[43]
Real-world anomaly detection in surveillance videos
Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018. 4, 1, 2
work page 2018
-
[44]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am- jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung- Levy. Videoagent: Long-form video understanding with large language model as agent. arXiv preprint arXiv:2403.10517,
-
[47]
Pax- ion: Patching action knowledge in video-language founda- tion models
Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, and Heng Ji. Pax- ion: Patching action knowledge in video-language founda- tion models. Advances in Neural Information Processing Systems, 36, 2023. 3
work page 2023
-
[48]
Star: A benchmark for situated reasoning in real-world videos
Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth conference on neural information processing systems datasets and benchmarks track (Round 2), 2021. 3
work page 2021
-
[49]
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context inter- leaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024. 2
-
[50]
Next-qa: Next phase of question-answering to explaining tem- poral actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining tem- poral actions. In Proceedings of the IEEE/CVF conference 10 on computer vision and pattern recognition, pages 9777– 9786, 2021. 2, 3
work page 2021
-
[51]
Funqa: Towards surprising video comprehension
Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, and Ziwei Liu. Funqa: Towards surprising video comprehension. arXiv preprint arXiv:2306.14899, 2023. 3
-
[52]
Msr-vtt: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016. 1, 2, 3
work page 2016
-
[53]
Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, and Yan Lu. Retrieval-based video language model for efficient long video question answering. arXiv preprint arXiv:2312.04931,
-
[54]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Clevrer: Collision events for video representation and reasoning
Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning. In International Conference on Learning Representations, 2019. 4
work page 2019
-
[56]
Activitynet-qa: A dataset for understanding complex web videos via question answer- ing
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answer- ing. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9127–9134, 2019. 3
work page 2019
-
[57]
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023. 1, 3, 8
work page internal anchor Pith review arXiv 2023
-
[58]
Movie101: A new movie understanding bench- mark
Zihao Yue, Qi Zhang, Anwen Hu, Liang Zhang, Ziheng Wang, and Qin Jin. Movie101: A new movie understanding bench- mark. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 4669–4684, 2023. 2, 6, 1
work page 2023
-
[59]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. arXiv preprint arXiv:2306.02858, 2023. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Long Context Transfer from Language to Vision
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024. 1, 2, 3, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
∞ bench: Extending long context evaluation beyond 100k tokens
Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, et al. ∞ bench: Extending long context evaluation beyond 100k tokens. arXiv preprint arXiv:2402.13718, 2024. 6
-
[62]
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852, 2023. 6
-
[63]
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision- language understanding with advanced large language mod- els. In The Twelfth International Conference on Learning Representations, 2023. 2 11 MLVU: Benchmarking Multi-task Long Video Understanding Supplementary Material A. Overview of Appendix • B: Evaluation...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.