pith. machine review for the scientific record. sign in

arxiv: 2403.00476 · v3 · submitted 2024-03-01 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

TempCompass: Do Video LLMs Really Understand Videos?

Authors on Pith no claims yet

Pith reviewed 2026-05-17 02:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords video large language modelstemporal perceptionbenchmarkvideo understandingconflicting videosaction orderingspeed and direction
0
0 comments X

The pith

Video LLMs exhibit notably poor temporal perception ability across aspects like speed and direction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TempCompass to address gaps in existing benchmarks that fail to separate specific temporal aspects and limit task variety. It collects test data through conflicting videos that share identical static content but vary in one targeted temporal feature, such as speed or ordering, to block reliance on single frames or language shortcuts. Instructions are generated via human meta-annotation followed by LLM expansion, and responses are scored automatically by another LLM. When applied to eight leading Video LLMs and three Image LLMs, the benchmark shows consistent weakness in distinguishing temporal details. A sympathetic reader cares because reliable video understanding requires grasping when things happen, not just what is present.

Core claim

By constructing conflicting videos that share the same static content but differ in a specific temporal aspect and by using a range of task formats, TempCompass shows that state-of-the-art Video LLMs display notably poor temporal perception ability.

What carries the argument

Conflicting videos that share the same static content but differ in one targeted temporal aspect, which blocks single-frame bias and language priors while isolating perception of speed, direction, or ordering.

If this is right

  • Models that cannot separate temporal aspects will give unreliable answers on tasks that depend on timing or order.
  • Performance gaps appear across multiple formats, indicating the weakness is not limited to question-answering.
  • Image LLMs show similar limitations, suggesting the issue stems from lack of temporal modeling rather than video-specific training alone.
  • Future model development must target explicit handling of temporal differences to improve video comprehension.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The results suggest current Video LLMs function more like language models with occasional image access than true video reasoners.
  • Benchmarks that do not use controlled conflicts may overestimate capabilities by allowing models to guess from static or linguistic cues.
  • Training procedures could be strengthened by including pairs of videos that differ only in timing to force temporal sensitivity.

Load-bearing premise

The conflicting videos isolate the intended temporal aspect without introducing other exploitable cues and the LLM-based automatic evaluation correctly measures true model performance.

What would settle it

A Video LLM that distinguishes the differing temporal feature in conflicting videos at accuracy well above random guessing on most task types would undermine the finding of poor temporal perception.

read the original abstract

Recently, there is a surge in interest surrounding video large language models (Video LLMs). However, existing benchmarks fail to provide a comprehensive feedback on the temporal perception ability of Video LLMs. On the one hand, most of them are unable to distinguish between different temporal aspects (e.g., speed, direction) and thus cannot reflect the nuanced performance on these specific aspects. On the other hand, they are limited in the diversity of task formats (e.g., only multi-choice QA), which hinders the understanding of how temporal perception performance may vary across different types of tasks. Motivated by these two problems, we propose the \textbf{TempCompass} benchmark, which introduces a diversity of temporal aspects and task formats. To collect high-quality test data, we devise two novel strategies: (1) In video collection, we construct conflicting videos that share the same static content but differ in a specific temporal aspect, which prevents Video LLMs from leveraging single-frame bias or language priors. (2) To collect the task instructions, we propose a paradigm where humans first annotate meta-information for a video and then an LLM generates the instruction. We also design an LLM-based approach to automatically and accurately evaluate the responses from Video LLMs. Based on TempCompass, we comprehensively evaluate 8 state-of-the-art (SOTA) Video LLMs and 3 Image LLMs, and reveal the discerning fact that these models exhibit notably poor temporal perception ability. Our data will be available at https://github.com/llyx97/TempCompass.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces TempCompass, a benchmark for assessing temporal perception in Video LLMs. It addresses limitations in prior benchmarks by covering multiple temporal aspects (speed, direction, etc.) and diverse task formats beyond multiple-choice QA. Key innovations include constructing conflicting video pairs that share static content but differ in one temporal dimension to reduce single-frame bias and language priors, a human-annotation followed by LLM instruction generation paradigm, and an LLM-based automatic evaluation method. The authors evaluate 8 SOTA Video LLMs and 3 Image LLMs, concluding that these models exhibit notably poor temporal perception ability, with data to be released publicly.

Significance. If the benchmark construction holds, this provides a valuable diagnostic tool for Video LLM development by isolating specific temporal dimensions and task types, highlighting a clear gap in current models' video understanding. Strengths include the public dataset release, the hybrid human-LLM data collection approach, and the focus on nuanced temporal aspects rather than aggregate scores.

major comments (1)
  1. [§3.1] §3.1: The construction of conflicting videos is presented as sharing identical static content while differing only in a targeted temporal aspect (e.g., speed or direction), yet the section provides no quantitative verification such as pixel-wise static similarity metrics, optical-flow difference statistics, compression artifact checks, or aggregated human validation scores confirming the absence of residual non-temporal cues. This directly bears on the central claim of poor temporal perception, because any exploitable static, audio, or editing differences would allow models to succeed without genuine temporal reasoning.
minor comments (2)
  1. [Abstract] Abstract: While the high-level evaluation outcome is stated, no specific quantitative results (e.g., accuracy percentages per temporal aspect or task format) are included, which would strengthen the reader's ability to gauge the scale of the reported deficiencies.
  2. The LLM-based automatic evaluation is described at a high level; adding details on prompt templates, agreement rates with human judgments, or error analysis would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and describe the changes we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [§3.1] §3.1: The construction of conflicting videos is presented as sharing identical static content while differing only in a targeted temporal aspect (e.g., speed or direction), yet the section provides no quantitative verification such as pixel-wise static similarity metrics, optical-flow difference statistics, compression artifact checks, or aggregated human validation scores confirming the absence of residual non-temporal cues. This directly bears on the central claim of poor temporal perception, because any exploitable static, audio, or editing differences would allow models to succeed without genuine temporal reasoning.

    Authors: We agree that additional quantitative verification would further support the claim that conflicting video pairs differ only in the targeted temporal dimension. In the original data collection, we performed manual editing and human review to minimize static differences, but the manuscript does not report explicit metrics. In the revised version, we will add pixel-wise static similarity metrics (e.g., average SSIM and MSE across corresponding frames), optical-flow difference statistics between pairs, and aggregated human validation scores confirming the absence of residual non-temporal cues. We will also note any audio or compression considerations for the selected videos. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark rests on new data construction, not derivations or self-referential reductions

full rationale

This is an empirical benchmark paper introducing TempCompass with novel video collection strategies (conflicting videos sharing static content) and LLM-assisted instruction generation plus evaluation. No mathematical derivations, fitted parameters, or equations appear in the provided text. The central claim of poor temporal perception rests on newly collected test data rather than reducing to prior fitted quantities, self-citations, or ansatzes by construction. No load-bearing steps match the enumerated circularity patterns; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on domain assumptions about video construction and evaluation rather than new mathematical axioms or fitted parameters.

axioms (2)
  • domain assumption Conflicting videos that share static content but differ in one temporal aspect prevent models from using single-frame bias or language priors.
    Invoked in the video collection strategy described in the abstract.
  • domain assumption LLM-based automatic evaluation of model responses is accurate and reliable for this benchmark.
    Stated as the method for scoring Video LLM outputs.

pith-pipeline@v0.9.0 · 5599 in / 1219 out tokens · 26370 ms · 2026-05-17T02:40:35.903339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

  2. Motion-o: Trajectory-Grounded Video Reasoning

    cs.CV 2026-03 conditional novelty 7.0

    Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.

  3. VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?

    cs.CV 2025-12 unverdicted novelty 7.0

    VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.

  4. See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

    cs.CV 2025-12 unverdicted novelty 7.0

    AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.

  5. Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

    cs.CV 2025-05 conditional novelty 7.0

    Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.

  6. Video-R1: Reinforcing Video Reasoning in MLLMs

    cs.CV 2025-03 conditional novelty 7.0

    Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.

  7. WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

    cs.CV 2025-02 unverdicted novelty 7.0

    WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

  8. Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    cs.CV 2025-01 unverdicted novelty 7.0

    Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.

  9. POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.

  10. From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning

    cs.AI 2026-04 unverdicted novelty 6.0

    EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.

  11. Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.

  12. Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.

  13. STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering

    cs.CV 2026-04 unverdicted novelty 6.0

    STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.

  14. Streaming Video Instruction Tuning

    cs.CV 2025-12 unverdicted novelty 6.0

    Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.

  15. Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding

    cs.CV 2025-12 conditional novelty 6.0

    DEViL offloads spatial grounding to a detector via a distilled reference-semantic token and temporal consistency regularization, reaching 43.1% m_vIoU at 14.33 FPS on HC-STVG.

  16. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    cs.AI 2025-06 unverdicted novelty 6.0

    V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...

  17. LLaVA-Video: Video Instruction Tuning With Synthetic Data

    cs.CV 2024-10 unverdicted novelty 6.0

    LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

  18. TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

    cs.CV 2025-12 unverdicted novelty 5.0

    TempR1 applies temporal-aware multi-task RL using GRPO and three types of localization rewards to achieve SOTA temporal understanding in MLLMs with synergistic gains from joint optimization.

  19. Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

    cs.CV 2025-03 unverdicted novelty 5.0

    Time-R1 applies RL with verifiable rewards to post-train LVLMs for temporal video grounding, reaching state-of-the-art results on multiple datasets using only 2.5K samples while also improving general video capabilities.

  20. EasyVideoR1: Easier RL for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

  21. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

  22. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

135 extracted references · 135 canonical work pages · cited by 22 Pith papers · 26 internal anchors

  1. [1]

    ArXiv , year=

    LLaMA: Open and Efficient Foundation Language Models , author=. ArXiv , year=

  2. [2]

    ArXiv , year=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. ArXiv , year=

  3. [3]

    and Stoica, Ion and Xing, Eric P

    Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

  4. [4]

    Hashimoto , year =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , year =. Stanford alpaca: An instruction-following llama model , url =

  5. [5]

    Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , booktitle =

  6. [6]

    ArXiv , year=

    GPT-4 Technical Report , author=. ArXiv , year=

  7. [7]

    CoRR , url=

    OpenAI , title =. CoRR , url=

  8. [8]

    ArXiv , year=

    Qwen Technical Report , author=. ArXiv , year=

  9. [9]

    International Conference on Machine Learning , year=

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. International Conference on Machine Learning , year=

  10. [10]

    ArXiv , year=

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. ArXiv , year=

  11. [11]

    ArXiv , year=

    Visual Instruction Tuning , author=. ArXiv , year=

  12. [12]

    ArXiv , year=

    Improved Baselines with Visual Instruction Tuning , author=. ArXiv , year=

  13. [13]

    ArXiv , year=

    SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models , author=. ArXiv , year=

  14. [14]

    ArXiv , year=

    PandaGPT: One Model To Instruction-Follow Them All , author=. ArXiv , year=

  15. [15]

    ArXiv , year=

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality , author=. ArXiv , year=

  16. [16]

    ArXiv , year=

    VideoChat: Chat-Centric Video Understanding , author=. ArXiv , year=

  17. [17]

    ArXiv , year=

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding , author=. ArXiv , year=

  18. [18]

    ArXiv , year=

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , author=. ArXiv , year=

  19. [19]

    ArXiv , year=

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection , author=. ArXiv , year=

  20. [20]

    ArXiv , year=

    Valley: Video Assistant with Large Language model Enhanced abilitY , author=. ArXiv , year=

  21. [21]

    ArXiv , year=

    LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models , author=. ArXiv , year=

  22. [22]

    ArXiv , year=

    Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding , author=. ArXiv , year=

  23. [23]

    ArXiv , year=

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face , author=. ArXiv , year=

  24. [24]

    ArXiv , year=

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models , author=. ArXiv , year=

  25. [25]

    2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

    ViperGPT: Visual Inference via Python Execution for Reasoning , author=. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

  26. [26]

    ArXiv , year=

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action , author=. ArXiv , year=

  27. [27]

    ArXiv , year=

    Otter: A Multi-Modal Model with In-Context Instruction Tuning , author=. ArXiv , year=

  28. [28]

    ArXiv , year=

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. ArXiv , year=

  29. [29]

    ArXiv , year=

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. ArXiv , year=

  30. [30]

    ArXiv , year=

    TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding , author=. ArXiv , year=

  31. [31]

    ArXiv , year=

    TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding , author=. ArXiv , year=

  32. [32]

    ArXiv , year=

    SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models , author=. ArXiv , year=

  33. [33]

    ArXiv , year=

    LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment , author=. ArXiv , year=

  34. [34]

    2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    ImageBind One Embedding Space to Bind Them All , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  35. [35]

    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  36. [36]

    2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  37. [37]

    International Conference on Machine Learning , year=

    Learning Transferable Visual Models From Natural Language Supervision , author=. International Conference on Machine Learning , year=

  38. [38]

    ArXiv , year=

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. ArXiv , year=

  39. [39]

    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    A ConvNet for the 2020s , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  40. [40]

    ArXiv , year=

    DINOv2: Learning Robust Visual Features without Supervision , author=. ArXiv , year=

  41. [41]

    ArXiv , year=

    LoRA: Low-Rank Adaptation of Large Language Models , author=. ArXiv , year=

  42. [42]

    2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

    Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , author=. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

  43. [43]

    ArXiv , year=

    AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering , author=. ArXiv , year=

  44. [44]

    Zhou Yu and Dejing Xu and Jun Yu and Ting Yu and Zhou Zhao and Yueting Zhuang and Dacheng Tao , title =

  45. [45]

    Tenenbaum , title =

    Kexin Yi and Chuang Gan and Yunzhu Li and Pushmeet Kohli and Jiajun Wu and Antonio Torralba and Joshua B. Tenenbaum , title =. 2020 , url =

  46. [46]

    Russell , title =

    Lisa Anne Hendricks and Oliver Wang and Eli Shechtman and Josef Sivic and Trevor Darrell and Bryan C. Russell , title =

  47. [47]

    ArXiv , year=

    LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models , author=. ArXiv , year=

  48. [48]

    ArXiv , year=

    MVBench: A Comprehensive Multi-modal Video Understanding Benchmark , author=. ArXiv , year=

  49. [49]

    ArXiv , year=

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. ArXiv , year=

  50. [50]

    ArXiv , year=

    MMBench: Is Your Multi-modal Model an All-around Player? , author=. ArXiv , year=

  51. [51]

    ArXiv , year=

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities , author=. ArXiv , year=

  52. [52]

    Proceedings of the 25th ACM international conference on Multimedia , year=

    Video Question Answering via Gradually Refined Attention over Appearance and Motion , author=. Proceedings of the 25th ACM international conference on Multimedia , year=

  53. [53]

    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions , author=. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  54. [54]

    2021 , url=

    STAR: A Benchmark for Situated Reasoning in Real-World Videos , author=. 2021 , url=

  55. [55]

    ArXiv , year=

    Perception Test: A Diagnostic Benchmark for Multimodal Video Models , author=. ArXiv , year=

  56. [56]

    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Revisiting the “Video” in Video-Language Understanding , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  57. [57]

    ArXiv , year=

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. ArXiv , year=

  58. [58]

    Something Something

    The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , author=. 2017 IEEE International Conference on Computer Vision (ICCV) , year=

  59. [59]

    2021 IEEE Winter Conference on Applications of Computer Vision (WACV) , year=

    Only Time Can Tell: Discovering Temporal Data for Temporal Modeling , author=. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) , year=

  60. [60]

    2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Test of Time: Instilling Video-Language Models with a Sense of Time , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  61. [61]

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , author=. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  62. [62]

    ArXiv , year=

    TouchStone: Evaluating Vision-Language Models by Language Models , author=. ArXiv , year=

  63. [63]

    ArXiv , year=

    VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models , author=. ArXiv , year=

  64. [64]

    ArXiv , year=

    Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models , author=. ArXiv , year=

  65. [65]

    2024 , booktitle=

    ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models , author=. 2024 , booktitle=

  66. [66]

    ArXiv , year=

    VLM-Eval: A General Evaluation on Video Large Language Models , author=. ArXiv , year=

  67. [67]

    ArXiv , year=

    Revealing Single Frame Bias for Video-and-Language Learning , author=. ArXiv , year=

  68. [68]

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

    What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets , author=. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

  69. [69]

    ArXiv , year=

    CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning , author=. ArXiv , year=

  70. [70]

    Piyush Bagad, Makarand Tapaswi, and Cees G. M. Snoek. 2023. https://api.semanticscholar.org/CorpusID:255440354 Test of time: Instilling video-language models with a sense of time . 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2503--2516

  71. [71]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenhang Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, K. Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, ...

  72. [72]

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023 b . https://api.semanticscholar.org/CorpusID:261101015 Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond . ArXiv, abs/2308.12966

  73. [73]

    Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xing Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, and Jingren Zhou. 2023 c . https://api.semanticscholar.org/CorpusID:261397179 Touchstone: Evaluating vision-language models by language models . ArXiv, abs/2308.16890

  74. [74]

    Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1708--1718

  75. [75]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

  76. [76]

    Buch, Cristobal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles

    S. Buch, Cristobal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. 2022. https://api.semanticscholar.org/CorpusID:249375461 Revisiting the “video” in video-language understanding . 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2907--2917

  77. [77]

    Xiuyuan Chen, Yuan Lin, Yuchen Zhang, and Weiran Huang. 2023. https://api.semanticscholar.org/CorpusID:265456806 Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering . ArXiv, abs/2311.14906

  78. [78]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. https://lmsys.org/blog/2023-03-30-vicuna/ Vicuna: An open-source chatbot impressing gpt-4 with 90\

  79. [79]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. 2023. https://api.semanticscholar.org/CorpusID:258615266 Instructblip: Towards general-purpose vision-language models with instruction tuning . ArXiv, abs/2305.06500

  80. [80]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. https://api.semanticscholar.org/CorpusID:225039882 An image is worth 16x16 words: Transformers for image recognition at scale . ArXiv, abs/2010.11929

Showing first 80 references.