pith. machine review for the scientific record. sign in

arxiv: 2311.17005 · v4 · pith:JO2LRJNGnew · submitted 2023-11-28 · 💻 cs.CV

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Pith reviewed 2026-05-17 20:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-modal large language modelsvideo understanding benchmarktemporal reasoningMLLM evaluationVideoChat2static-to-dynamic conversionmultiple-choice QA
0
0 comments X

The pith

Most multi-modal AI models fail at temporal understanding in videos, but a new benchmark and training method lift performance by more than 15 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates MVBench to test multi-modal large language models on twenty video tasks that demand changes over time and cannot be solved from any single frame. It defines these tasks by converting established static image problems into dynamic video versions, then automatically turns public video annotations into multiple-choice questions for evaluation. Results show current leading models perform poorly on these temporal skills, while the authors' VideoChat2 baseline, built through progressive multi-modal training, outperforms them by a large margin. This setup allows rapid, low-bias benchmark construction because it reuses existing ground-truth labels instead of relying on LLM scoring.

Core claim

MVBench covers twenty challenging video tasks that cannot be effectively solved with a single frame. These tasks are generated through a static-to-dynamic conversion that systematically produces examples requiring temporal skills from basic perception to higher cognition. Existing MLLMs remain far from satisfactory in temporal understanding, while VideoChat2 surpasses leading models by over fifteen percent on the benchmark.

What carries the argument

The static-to-dynamic method that transforms static image tasks into dynamic video tasks to generate a broad range of temporal skills from perception to cognition, paired with automatic conversion of public annotations into multiple-choice QA pairs.

If this is right

  • Current MLLMs need explicit temporal training to handle real-world video content reliably.
  • Benchmarks built from reused annotations can scale evaluation of dynamic skills without heavy manual labeling.
  • VideoChat2's progressive training recipe provides a practical path to stronger temporal performance in video models.
  • Fairness in scoring improves when evaluation stays tied to original ground-truth labels rather than LLM judgments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread use of MVBench could shift model development away from image-only pretraining toward sequence-aware architectures.
  • The same static-to-dynamic conversion idea might extend to other modalities such as audio or 3D scene understanding.
  • Longer video clips or open-ended questions could be added later to test whether the current gains hold for more complex narratives.

Load-bearing premise

Automatically turning public video annotations into multiple-choice questions accurately measures the intended temporal skills without creating annotation biases or letting models succeed via single-frame shortcuts.

What would settle it

A controlled test in which top models score nearly as high on MVBench after temporal order is randomly shuffled or timing cues are removed, showing the benchmark can be passed without genuine sequence understanding.

read the original abstract

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MVBench, a benchmark with 20 video tasks for assessing temporal understanding in multi-modal large language models (MLLMs). Tasks are created via a static-to-dynamic conversion method applied to public video annotations, which are then automatically transformed into multiple-choice QA pairs. The authors also propose VideoChat2, a video MLLM trained with progressive multi-modal instruction tuning, and report that existing MLLMs perform poorly on temporal tasks while VideoChat2 outperforms them by over 15% on the new benchmark.

Significance. If the tasks genuinely isolate temporal reasoning, MVBench would provide a valuable, scalable diagnostic for video MLLMs that current static-image benchmarks do not address. The automatic annotation-conversion pipeline and open release of models, data, and code at the GitHub repository are strengths that support reproducibility and community follow-up. The approach of deriving dynamic tasks from established static ones offers a systematic way to cover perception-to-cognition temporal skills.

major comments (2)
  1. [§3] §3 (Task Definition and static-to-dynamic method): The central claim that the 20 tasks 'cannot be effectively solved with a single frame' is load-bearing for interpreting MVBench as a temporal-understanding benchmark, yet the manuscript provides no single-frame baselines, static-cue ablations, or human validation of shortcut resistance. Without these controls, performance differences could reflect exploitation of frame-level appearance or annotation patterns rather than dynamics, directly affecting the interpretation of VideoChat2's >15% gain.
  2. [§5] §5 (Experiments and results): The reported superiority of VideoChat2 is shown only on MVBench; adding comparisons against the same models on established video benchmarks (e.g., those already testing temporal reasoning) would strengthen the claim that the improvement reflects genuine advances in temporal capability rather than benchmark-specific tuning.
minor comments (2)
  1. [Abstract] The abstract and §1 could preview the exact average score and per-task range for the 15% improvement to give readers an immediate sense of effect size.
  2. [Figures] Figure captions and task examples would benefit from explicit indication of which visual cues are static versus dynamic to help readers quickly grasp the conversion procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential value of MVBench as a diagnostic for temporal understanding in video MLLMs, as well as the strengths in reproducibility. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Task Definition and static-to-dynamic method): The central claim that the 20 tasks 'cannot be effectively solved with a single frame' is load-bearing for interpreting MVBench as a temporal-understanding benchmark, yet the manuscript provides no single-frame baselines, static-cue ablations, or human validation of shortcut resistance. Without these controls, performance differences could reflect exploitation of frame-level appearance or annotation patterns rather than dynamics, directly affecting the interpretation of VideoChat2's >15% gain.

    Authors: We agree that empirical validation is necessary to substantiate the claim that the tasks require temporal reasoning rather than static cues. The static-to-dynamic conversion is constructed so that each task explicitly depends on temporal information (e.g., ordering of events or changes across frames) that is absent from any individual frame. Nevertheless, to strengthen the manuscript, we will add single-frame baselines for all 20 tasks, which will quantify the performance drop when temporal context is removed. We will also include a brief analysis of potential annotation patterns and how the automatic multiple-choice QA generation, grounded in public video annotations, reduces the risk of exploitable shortcuts. revision: yes

  2. Referee: [§5] §5 (Experiments and results): The reported superiority of VideoChat2 is shown only on MVBench; adding comparisons against the same models on established video benchmarks (e.g., those already testing temporal reasoning) would strengthen the claim that the improvement reflects genuine advances in temporal capability rather than benchmark-specific tuning.

    Authors: While the primary contribution is the introduction of MVBench to expose limitations in existing MLLMs on temporal tasks, we acknowledge that cross-benchmark evaluation would better contextualize VideoChat2's gains. In the revised manuscript we will report results for VideoChat2 and the compared models on additional established video benchmarks that emphasize temporal reasoning, thereby clarifying whether the observed improvements generalize beyond MVBench. revision: yes

Circularity Check

0 steps flagged

Benchmark construction relies on external annotations and explicit transformation method with no self-referential reduction

full rationale

The paper defines MVBench tasks via a static-to-dynamic conversion of public video annotations into MCQA pairs and reports empirical model scores including a >15% gain for VideoChat2. No equation, parameter fit, or derivation reduces to its own inputs by construction; the temporal-requirement claim follows directly from the stated transformation procedure rather than a loop, and results are obtained by running models on the generated benchmark. The methodology is self-contained against external data sources and does not invoke load-bearing self-citations or uniqueness theorems that collapse the central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the assumption that public video annotations contain sufficient temporal information and that the transformation preserves task validity without new fitted parameters or invented entities.

axioms (1)
  • domain assumption Public video annotations can be reliably converted to multiple-choice QA without loss of temporal information or introduction of bias.
    Invoked in the automatic conversion step described in the abstract.

pith-pipeline@v0.9.0 · 5611 in / 1101 out tokens · 39585 ms · 2026-05-17T20:18:29.561336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FCMBench-Video: Benchmarking Document Video Intelligence

    cs.CV 2026-04 unverdicted novelty 7.0

    FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.

  2. AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.

  3. Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

    cs.CV 2026-03 unverdicted novelty 7.0

    SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.

  4. SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

    cs.CV 2025-04 unverdicted novelty 7.0

    SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.

  5. MLVU: Benchmarking Multi-task Long Video Understanding

    cs.CV 2024-06 conditional novelty 7.0

    MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

  6. Co-Evolving Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...

  7. MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.

  8. MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.

  9. QoS-QoE Translation with Large Language Model

    cs.MM 2026-04 unverdicted novelty 6.0

    A new QoS-QoE Translation dataset is constructed from multimedia literature and fine-tuned LLMs demonstrate strong performance on bidirectional continuous and discrete QoS-QoE predictions.

  10. VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation

    cs.CV 2026-04 conditional novelty 6.0

    VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while imp...

  11. HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

    cs.CV 2026-01 unverdicted novelty 6.0

    HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.

  12. TempCompass: Do Video LLMs Really Understand Videos?

    cs.CV 2024-03 unverdicted novelty 6.0

    TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.

  13. VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 5.0

    Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.

  14. InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

    cs.CV 2024-07 conditional novelty 5.0

    InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

  15. PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

    cs.CV 2024-04 conditional novelty 5.0

    A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.

  16. EasyVideoR1: Easier RL for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

  17. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · cited by 16 Pith papers · 24 internal anchors

  1. [1]

    Flamingo: a Visual Language Model for Few-Shot Learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andy Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Binkow...

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966, 2023. 11

  3. [3]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021. 6, 8

  4. [4]

    Ali Furkan Biten, Rub `en P ´erez Tito, Andr ´es Mafla, Llu ´ıs G´omez, Marc ¸al Rusi˜nol, Ernest Valveny, C. V . Jawahar, and Dimosthenis Karatzas. Scene text visual question answer- ing. In ICCV, 2019. 6

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020. 2

  6. [6]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021. 6

  7. [7]

    Chen and William B

    David L. Chen and William B. Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL, 2011. 2

  8. [8]

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elho- seiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. ArXiv, abs/2310.09478, 2023. 11

  9. [9]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Ke Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. ArXiv, abs/2306.15195, 2023. 11

  10. [10]

    Shazeer, Vinodkumar Prab- hakaran, Emily Reif, Nan Du, Benton C

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebas- tian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prab- hakaran, Emily Reif, Nan Du, Benton C. Hutchinson, Reiner Pope...

  11. [11]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pas- cale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023. 2, 6, 7, 8

  12. [12]

    Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS, 2022. 9

  13. [13]

    Doell, and Jason J

    Pradipto Das, Chenliang Xu, Richard F. Doell, and Jason J. Corso. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In CVPR, 2013. 6

  14. [14]

    Imagenet: A large-scale hierarchical im- age database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In CVPR, 2009. 6

  15. [15]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. ArXiv, abs/1810.04805, 2018. 1, 2, 6

  16. [16]

    Xia, Mehdi S

    Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Florence. Palm-e: An e...

  17. [17]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for mul- timodal large language models. ArXiv, abs/2306.13394,

  18. [18]

    Violet: End-to- end video-language transformers with masked visual-token modeling

    Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End-to- end video-language transformers with masked visual-token modeling. ArXiv, abs/2111.12681, 2021. 10

  19. [19]

    Mist : Multi-modal iterative spatial-temporal transformer for long-form video question answering

    Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yezhou Yang, and Mike Zheng Shou. Mist : Multi-modal iterative spatial-temporal transformer for long-form video question answering. In CVPR, 2022. 10

  20. [20]

    Gao, Chen Sun, Zhenheng Yang, and Ramakant Nevatia

    J. Gao, Chen Sun, Zhenheng Yang, and Ramakant Nevatia. Tall: Temporal activity localization via language query. In ICCV, 2017. 3, 12

  21. [21]

    Multimodal-gpt: A vision and lan- guage model for dialogue with humans

    Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qianmengke Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vi- sion and language model for dialogue with humans. ArXiv, abs/2305.04790, 2023. 2

  22. [22]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fr ¨und, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The “something something” video database for learning and evaluating visual common sense. In ICCV, 2017. 2, 6, 9

  23. [23]

    Making the v in vqa matter: El- evating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: El- evating the role of image understanding in visual question answering. In CVPR, 2017. 2, 6

  24. [24]

    Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jack- son Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh K. Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z. Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, S...

  25. [25]

    Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen

    J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR,

  26. [26]

    Language Is Not All You Need: Aligning Perception with Language Models

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Language is not all you need: Aligning perception with language models. ArXiv, abs/2302.14045, 2023. 1

  27. [27]

    Hudson and Christopher D

    Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019. 6

  28. [28]

    Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gun- hee Kim

    Y . Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gun- hee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR, 2017. 6

  29. [29]

    Mistral 7B

    Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, L’elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b. ArXiv, abs/2...

  30. [30]

    Lawrence Zitnick, and Ross B

    Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Gir- shick. Clevr: A diagnostic dataset for compositional lan- guage and elementary visual reasoning. In CVPR, 2017. 4, 6

  31. [31]

    The Kinetics Human Action Video Dataset

    Will Kay, Jo ˜ao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Apostol Natsev, Mustafa Suley- man, and Andrew Zisserman. The kinetics human action video dataset. ArXiv, abs/1705.06950, 2017. 2

  32. [32]

    Beyond the nav-graph: Vision-and- language navigation in continuous environments

    Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Ba- tra, and Stefan Lee. Beyond the nav-graph: Vision-and- language navigation in continuous environments. InECCV,

  33. [33]

    A hierarchical approach for generating descriptive image paragraphs

    Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. A hierarchical approach for generating descriptive image paragraphs. In CVPR, 2017. 6

  34. [34]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017. 6

  35. [35]

    Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L. Berg. Tvqa: Localized, compositional video question answering. In EMNLP, 2018. 3, 7, 10, 12

  36. [36]

    Moreno, and Jes ´us Lov´on-Melgarejo

    Paul Lerner, Olivier Ferret, Camille Guinaudeau, Herv ´e Le Borgne, Romaric Besanc ¸on, Jos´e G. Moreno, and Jes ´us Lov´on-Melgarejo. Viquae, a dataset for knowledge-based visual question answering about named entities. In SIGIR,

  37. [37]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv, abs/2307.16125, 2023. 2, 9

  38. [38]

    Otter: A Multi-Modal Model with In-Context Instruction Tuning

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. ArXiv, abs/2305.03726,

  39. [39]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2022. 1, 6, 7

  40. [40]

    Inten- tqa: Context-aware video intent reasoning

    Jiapeng Li, Ping Wei, Wenjuan Han, and Lifeng Fan. Inten- tqa: Context-aware video intent reasoning. 2023. 7, 10

  41. [41]

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Y . Qiao. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer.ArXiv, abs/2211.09552, 2022. 6, 10

  42. [42]

    VideoChat: Chat-Centric Video Understanding

    Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wen Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. ArXiv, abs/2305.06355, 2023. 1, 2, 5, 6, 7, 8, 9, 10, 11

  43. [43]

    Unmasked teacher: Towards training-efficient video foundation models

    Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. In ICCV, 2023. 2, 6, 8, 9, 10, 12

  44. [44]

    M3it: A large-scale dataset towards multi-modal multilingual instruction tun- ing

    Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, Lingpeng Kong, and Qi Liu. M3it: A large-scale dataset towards multi-modal multilingual instruction tun- ing. ArXiv, abs/2306.04387, 2023. 5

  45. [45]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji rong Wen. Evaluating object hallucination in large vision-language models. ArXiv, abs/2305.10355,

  46. [46]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 6, 8

  47. [47]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 1, 2, 6, 7, 10

  48. [48]

    Ntu rgb+d 120: A large-scale benchmark for 3d human activity understand- ing

    Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C Kot. Ntu rgb+d 120: A large-scale benchmark for 3d human activity understand- ing. TPAMI, 2020. 3, 12

  49. [49]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuanzhan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mm- bench: Is your multi-modal model an all-around player? ArXiv, abs/2307.06281, 2023. 1, 2, 3, 5, 8, 9

  50. [50]

    Val- ley: Video assistant with large language model enhanced ability

    Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Ming- Hui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. Val- ley: Video assistant with large language model enhanced ability. ArXiv, abs/2306.07207, 2023. 2

  51. [51]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv, abs/2306.05424, 2023. 2, 6, 7, 8, 10, 11

  52. [52]

    Egoschema: A diagnostic benchmark for very long-form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jiten- dra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. ArXiv, abs/2308.09126, 2023. 7, 10

  53. [53]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019. 2, 6

  54. [54]

    Manmatha, and C

    Minesh Mathew, Dimosthenis Karatzas, R. Manmatha, and C. V . Jawahar. Docvqa: A dataset for vqa on document images. In WACV, 2021. 6

  55. [55]

    Ocr-vqa: Visual question answering by reading text in images

    Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019. 6

  56. [56]

    Spoken moments: Learning joint audio-visual representations from video de- scriptions

    Mathew Monfort and SouYoung Jin. Spoken moments: Learning joint audio-visual representations from video de- scriptions. In CVPR, 2021. 7

  57. [57]

    Brown, Quanfu Fan, Dan Gutfreund, Carl V ondrick, and Aude Oliva

    Mathew Monfort, Bolei Zhou, Sarah Adel Bargal, Alex An- donian, Tom Yan, Kandan Ramakrishnan, Lisa M. Brown, Quanfu Fan, Dan Gutfreund, Carl V ondrick, and Aude Oliva. Moments in time dataset: One million videos for event understanding. TPAMI, 2020. 3, 12

  58. [58]

    OpenAI. Chatgpt. https://openai.com/blog/ chatgpt/, 2023. 1, 4, 5, 8, 10

  59. [59]

    Gpt-4v(ision) system card

    OpenAI. Gpt-4v(ision) system card. https://api. semanticscholar . org / CorpusID : 263218031,

  60. [60]

    Im2text: Describing images using 1 million captioned pho- tographs

    Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned pho- tographs. In NeurIPS, 2011. 6

  61. [61]

    Koster, Junlin Zhang, Stephanie, Winkler, Yusuf Aytar, Si- mon Osindero, Dima Damen, Andrew Zisserman, and Jo˜ao Carreira

    Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adri`a Recasens Continente, Larisa Markeeva, Dylan, Ba- narse, Mateusz Malinowski, Yezhou Yang, Carl Doer- sch, Tatiana Matejovicova, Yury Sulsky, Antoine, Miech, Skanda Koppula, Alexander Fr´echette, Hanna Klimczak, R. Koster, Junlin Zhang, Stephanie, Winkler, Yusuf Aytar, Si- mon Osindero, Dima Damen, Andr...

  62. [62]

    Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models

    Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. In ICCV,

  63. [63]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020. 2

  64. [64]

    A-okvqa: A benchmark for visual question answering using world knowledge

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV, 2022. 6

  65. [65]

    Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InACL,

  66. [66]

    Textcaps: a dataset for image caption- ing with reading comprehension

    Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image caption- ing with reading comprehension. In ECCV, 2020. 6

  67. [67]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR,

  68. [68]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Yu Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. ArXiv, abs/2303.15389, 2023. 8

  69. [69]

    Vi- sualmrc: Machine reading comprehension on document im- ages

    Ryota Tanaka, Kyosuke Nishida, and Sen Yoshida. Vi- sualmrc: Machine reading comprehension on document im- ages. In AAAI, 2021. 6

  70. [70]

    Internlm: A multilingual language model with progressively enhanced capabilities

    InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https : / / github.com/InternLM/InternLM, 2023. 2

  71. [71]

    Vicuna: An open-source chatbot impress- ing gpt-4 with 90% chatgpt quality

    Vicuna Team. Vicuna: An open-source chatbot impress- ing gpt-4 with 90% chatgpt quality. https://vicuna. lmsys.org/, 2023. 1, 6, 8

  72. [72]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Bap- tiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023. 1, 7, 8

  73. [73]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash- lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cant ´on Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fer- nandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goy...

  74. [74]

    All in one: Exploring unified video-language pre-training

    Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. All in one: Exploring unified video-language pre-training. In CVPR, 2023. 10

  75. [75]

    Temporal segment networks: Towards good practices for deep action recogni- tion

    Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recogni- tion. In ECCV, 2016. 9

  76. [76]

    Videomae v2: Scaling video masked autoencoders with dual masking

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In CVPR, 2023. 9

  77. [77]

    InternVideo: General Video Foundation Models via Generative and Discriminative Learning

    Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning. ArXiv, abs/2212.03191, 2022. 10

  78. [78]

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Jian Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Zi- wei Liu, Yali Wang, Limin Wang, and Y . Qiao. Internvid: A large-scale video-text dataset for multimodal understanding and generation. ArXiv, 2023. 6

  79. [79]

    Pax- ion: Patching action knowledge in video-language founda- tion models

    Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, and Heng Ji. Pax- ion: Patching action knowledge in video-language founda- tion models. In NeurIPS, 2023. 3, 9, 12

  80. [80]

    Dai, and Quoc V

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. In ICLR, 2021. 2

Showing first 80 references.