arxiv: 2403.00476 · v3 · submitted 2024-03-01 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

TempCompass: Do Video LLMs Really Understand Videos?

Yuanxin Liu , Shicheng Li , Yi Liu , Yuxiang Wang , Shuhuai Ren , Lei Li , Sishuo Chen , Xu Sun

show 1 more author

Lu Hou

Authors on Pith no claims yet

Pith reviewed 2026-05-17 02:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords video large language modelstemporal perceptionbenchmarkvideo understandingconflicting videosaction orderingspeed and direction

0 comments

The pith

Video LLMs exhibit notably poor temporal perception ability across aspects like speed and direction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TempCompass to address gaps in existing benchmarks that fail to separate specific temporal aspects and limit task variety. It collects test data through conflicting videos that share identical static content but vary in one targeted temporal feature, such as speed or ordering, to block reliance on single frames or language shortcuts. Instructions are generated via human meta-annotation followed by LLM expansion, and responses are scored automatically by another LLM. When applied to eight leading Video LLMs and three Image LLMs, the benchmark shows consistent weakness in distinguishing temporal details. A sympathetic reader cares because reliable video understanding requires grasping when things happen, not just what is present.

Core claim

By constructing conflicting videos that share the same static content but differ in a specific temporal aspect and by using a range of task formats, TempCompass shows that state-of-the-art Video LLMs display notably poor temporal perception ability.

What carries the argument

Conflicting videos that share the same static content but differ in one targeted temporal aspect, which blocks single-frame bias and language priors while isolating perception of speed, direction, or ordering.

If this is right

Models that cannot separate temporal aspects will give unreliable answers on tasks that depend on timing or order.
Performance gaps appear across multiple formats, indicating the weakness is not limited to question-answering.
Image LLMs show similar limitations, suggesting the issue stems from lack of temporal modeling rather than video-specific training alone.
Future model development must target explicit handling of temporal differences to improve video comprehension.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results suggest current Video LLMs function more like language models with occasional image access than true video reasoners.
Benchmarks that do not use controlled conflicts may overestimate capabilities by allowing models to guess from static or linguistic cues.
Training procedures could be strengthened by including pairs of videos that differ only in timing to force temporal sensitivity.

Load-bearing premise

The conflicting videos isolate the intended temporal aspect without introducing other exploitable cues and the LLM-based automatic evaluation correctly measures true model performance.

What would settle it

A Video LLM that distinguishes the differing temporal feature in conflicting videos at accuracy well above random guessing on most task types would undermine the finding of poor temporal perception.

read the original abstract

Recently, there is a surge in interest surrounding video large language models (Video LLMs). However, existing benchmarks fail to provide a comprehensive feedback on the temporal perception ability of Video LLMs. On the one hand, most of them are unable to distinguish between different temporal aspects (e.g., speed, direction) and thus cannot reflect the nuanced performance on these specific aspects. On the other hand, they are limited in the diversity of task formats (e.g., only multi-choice QA), which hinders the understanding of how temporal perception performance may vary across different types of tasks. Motivated by these two problems, we propose the \textbf{TempCompass} benchmark, which introduces a diversity of temporal aspects and task formats. To collect high-quality test data, we devise two novel strategies: (1) In video collection, we construct conflicting videos that share the same static content but differ in a specific temporal aspect, which prevents Video LLMs from leveraging single-frame bias or language priors. (2) To collect the task instructions, we propose a paradigm where humans first annotate meta-information for a video and then an LLM generates the instruction. We also design an LLM-based approach to automatically and accurately evaluate the responses from Video LLMs. Based on TempCompass, we comprehensively evaluate 8 state-of-the-art (SOTA) Video LLMs and 3 Image LLMs, and reveal the discerning fact that these models exhibit notably poor temporal perception ability. Our data will be available at https://github.com/llyx97/TempCompass.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TempCompass shows Video LLMs struggle with temporal details but the conflicting video pairs need tighter checks to prove they block non-temporal shortcuts.

read the letter

Hi, the main thing to know is that TempCompass finds current Video LLMs perform poorly on tasks that require tracking time-based changes, yet the result hinges on whether the new test videos really cut off every other way to answer correctly. The paper introduces conflicting video pairs that keep the same static scene while changing only one temporal element such as speed, direction, or event order. This is combined with a two-step instruction process where humans first record meta details about the video and an LLM then turns those notes into varied questions and formats. They also set up an LLM to score the model answers automatically. These steps address real gaps in earlier benchmarks that mixed temporal skills together or limited themselves to multiple-choice questions. The evaluation covers eight leading Video LLMs plus three Image LLMs and reports consistently weak results, which lines up with the broader sense that time modeling remains underdeveloped. The soft spot is the lack of concrete verification for the video pairs. There are no reported similarity scores, optical flow stats, or human checks confirming that editing left no residual differences or artifacts. If models can still latch onto small visual or compression cues, the low scores may not isolate temporal perception as cleanly as claimed. The automatic evaluator is presented as reliable but without side-by-side human agreement numbers in the sections I saw, it is hard to gauge its accuracy. This paper is useful for anyone building or testing video understanding systems who wants a benchmark that separates temporal aspects and task types. Readers working on multimodal models or real-world video applications would find the results worth considering. The approach is grounded enough to deserve a full referee process rather than a desk rejection, mainly to strengthen the validation of the video construction and add more error analysis. I would send it out for review with a request to address the cue-isolation concern.

Referee Report

1 major / 2 minor

Summary. The paper introduces TempCompass, a benchmark for assessing temporal perception in Video LLMs. It addresses limitations in prior benchmarks by covering multiple temporal aspects (speed, direction, etc.) and diverse task formats beyond multiple-choice QA. Key innovations include constructing conflicting video pairs that share static content but differ in one temporal dimension to reduce single-frame bias and language priors, a human-annotation followed by LLM instruction generation paradigm, and an LLM-based automatic evaluation method. The authors evaluate 8 SOTA Video LLMs and 3 Image LLMs, concluding that these models exhibit notably poor temporal perception ability, with data to be released publicly.

Significance. If the benchmark construction holds, this provides a valuable diagnostic tool for Video LLM development by isolating specific temporal dimensions and task types, highlighting a clear gap in current models' video understanding. Strengths include the public dataset release, the hybrid human-LLM data collection approach, and the focus on nuanced temporal aspects rather than aggregate scores.

major comments (1)

[§3.1] §3.1: The construction of conflicting videos is presented as sharing identical static content while differing only in a targeted temporal aspect (e.g., speed or direction), yet the section provides no quantitative verification such as pixel-wise static similarity metrics, optical-flow difference statistics, compression artifact checks, or aggregated human validation scores confirming the absence of residual non-temporal cues. This directly bears on the central claim of poor temporal perception, because any exploitable static, audio, or editing differences would allow models to succeed without genuine temporal reasoning.

minor comments (2)

[Abstract] Abstract: While the high-level evaluation outcome is stated, no specific quantitative results (e.g., accuracy percentages per temporal aspect or task format) are included, which would strengthen the reader's ability to gauge the scale of the reported deficiencies.
The LLM-based automatic evaluation is described at a high level; adding details on prompt templates, agreement rates with human judgments, or error analysis would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and describe the changes we will make to strengthen the paper.

read point-by-point responses

Referee: [§3.1] §3.1: The construction of conflicting videos is presented as sharing identical static content while differing only in a targeted temporal aspect (e.g., speed or direction), yet the section provides no quantitative verification such as pixel-wise static similarity metrics, optical-flow difference statistics, compression artifact checks, or aggregated human validation scores confirming the absence of residual non-temporal cues. This directly bears on the central claim of poor temporal perception, because any exploitable static, audio, or editing differences would allow models to succeed without genuine temporal reasoning.

Authors: We agree that additional quantitative verification would further support the claim that conflicting video pairs differ only in the targeted temporal dimension. In the original data collection, we performed manual editing and human review to minimize static differences, but the manuscript does not report explicit metrics. In the revised version, we will add pixel-wise static similarity metrics (e.g., average SSIM and MSE across corresponding frames), optical-flow difference statistics between pairs, and aggregated human validation scores confirming the absence of residual non-temporal cues. We will also note any audio or compression considerations for the selected videos. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark rests on new data construction, not derivations or self-referential reductions

full rationale

This is an empirical benchmark paper introducing TempCompass with novel video collection strategies (conflicting videos sharing static content) and LLM-assisted instruction generation plus evaluation. No mathematical derivations, fitted parameters, or equations appear in the provided text. The central claim of poor temporal perception rests on newly collected test data rather than reducing to prior fitted quantities, self-citations, or ansatzes by construction. No load-bearing steps match the enumerated circularity patterns; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on domain assumptions about video construction and evaluation rather than new mathematical axioms or fitted parameters.

axioms (2)

domain assumption Conflicting videos that share static content but differ in one temporal aspect prevent models from using single-frame bias or language priors.
Invoked in the video collection strategy described in the abstract.
domain assumption LLM-based automatic evaluation of model responses is accurate and reliable for this benchmark.
Stated as the method for scoring Video LLM outputs.

pith-pipeline@v0.9.0 · 5599 in / 1219 out tokens · 26370 ms · 2026-05-17T02:40:35.903339+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

these models exhibit notably poor temporal perception ability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
Motion-o: Trajectory-Grounded Video Reasoning
cs.CV 2026-03 conditional novelty 7.0

Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.
VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?
cs.CV 2025-12 unverdicted novelty 7.0

VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
cs.CV 2025-12 unverdicted novelty 7.0

AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
cs.CV 2025-05 conditional novelty 7.0

Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
Video-R1: Reinforcing Video Reasoning in MLLMs
cs.CV 2025-03 conditional novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
cs.CV 2025-02 unverdicted novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
cs.CV 2025-01 unverdicted novelty 7.0

Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
cs.AI 2026-04 unverdicted novelty 6.0

EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
cs.CV 2026-04 unverdicted novelty 6.0

STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.
Streaming Video Instruction Tuning
cs.CV 2025-12 unverdicted novelty 6.0

Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding
cs.CV 2025-12 conditional novelty 6.0

DEViL offloads spatial grounding to a detector via a distilled reference-semantic token and temporal consistency regularization, reaching 43.1% m_vIoU at 14.33 FPS on HC-STVG.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
cs.AI 2025-06 unverdicted novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning
cs.CV 2025-12 unverdicted novelty 5.0

TempR1 applies temporal-aware multi-task RL using GRPO and three types of localization rewards to achieve SOTA temporal understanding in MLLMs with synergistic gains from joint optimization.
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
cs.CV 2025-03 unverdicted novelty 5.0

Time-R1 applies RL with verifiable rewards to post-train LVLMs for temporal video grounding, reaching state-of-the-art results on multiple datasets using only 2.5K samples while also improving general video capabilities.
EasyVideoR1: Easier RL for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

135 extracted references · 135 canonical work pages · cited by 22 Pith papers · 26 internal anchors

[1]

ArXiv , year=

LLaMA: Open and Efficient Foundation Language Models , author=. ArXiv , year=

work page
[2]

ArXiv , year=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. ArXiv , year=

work page
[3]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

work page
[4]

Hashimoto , year =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , year =. Stanford alpaca: An instruction-following llama model , url =

work page
[5]

Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , booktitle =

work page
[6]

ArXiv , year=

GPT-4 Technical Report , author=. ArXiv , year=

work page
[7]

CoRR , url=

OpenAI , title =. CoRR , url=

work page
[8]

ArXiv , year=

Qwen Technical Report , author=. ArXiv , year=

work page
[9]

International Conference on Machine Learning , year=

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. International Conference on Machine Learning , year=

work page
[10]

ArXiv , year=

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. ArXiv , year=

work page
[11]

ArXiv , year=

Visual Instruction Tuning , author=. ArXiv , year=

work page
[12]

ArXiv , year=

Improved Baselines with Visual Instruction Tuning , author=. ArXiv , year=

work page
[13]

ArXiv , year=

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models , author=. ArXiv , year=

work page
[14]

ArXiv , year=

PandaGPT: One Model To Instruction-Follow Them All , author=. ArXiv , year=

work page
[15]

ArXiv , year=

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality , author=. ArXiv , year=

work page
[16]

ArXiv , year=

VideoChat: Chat-Centric Video Understanding , author=. ArXiv , year=

work page
[17]

ArXiv , year=

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding , author=. ArXiv , year=

work page
[18]

ArXiv , year=

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , author=. ArXiv , year=

work page
[19]

ArXiv , year=

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection , author=. ArXiv , year=

work page
[20]

ArXiv , year=

Valley: Video Assistant with Large Language model Enhanced abilitY , author=. ArXiv , year=

work page
[21]

ArXiv , year=

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models , author=. ArXiv , year=

work page
[22]

ArXiv , year=

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding , author=. ArXiv , year=

work page
[23]

ArXiv , year=

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face , author=. ArXiv , year=

work page
[24]

ArXiv , year=

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models , author=. ArXiv , year=

work page
[25]

2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

ViperGPT: Visual Inference via Python Execution for Reasoning , author=. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

work page 2023
[26]

ArXiv , year=

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action , author=. ArXiv , year=

work page
[27]

ArXiv , year=

Otter: A Multi-Modal Model with In-Context Instruction Tuning , author=. ArXiv , year=

work page
[28]

ArXiv , year=

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. ArXiv , year=

work page
[29]

ArXiv , year=

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. ArXiv , year=

work page
[30]

ArXiv , year=

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding , author=. ArXiv , year=

work page
[31]

ArXiv , year=

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding , author=. ArXiv , year=

work page
[32]

ArXiv , year=

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models , author=. ArXiv , year=

work page
[33]

ArXiv , year=

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment , author=. ArXiv , year=

work page
[34]

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

ImageBind One Embedding Space to Bind Them All , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2023
[35]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2022
[36]

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2023
[37]

International Conference on Machine Learning , year=

Learning Transferable Visual Models From Natural Language Supervision , author=. International Conference on Machine Learning , year=

work page
[38]

ArXiv , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. ArXiv , year=

work page
[39]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

A ConvNet for the 2020s , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2022
[40]

ArXiv , year=

DINOv2: Learning Robust Visual Features without Supervision , author=. ArXiv , year=

work page
[41]

ArXiv , year=

LoRA: Low-Rank Adaptation of Large Language Models , author=. ArXiv , year=

work page
[42]

2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , author=. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

work page 2021
[43]

ArXiv , year=

AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering , author=. ArXiv , year=

work page
[44]

Zhou Yu and Dejing Xu and Jun Yu and Ting Yu and Zhou Zhao and Yueting Zhuang and Dacheng Tao , title =

work page
[45]

Tenenbaum , title =

Kexin Yi and Chuang Gan and Yunzhu Li and Pushmeet Kohli and Jiajun Wu and Antonio Torralba and Joshua B. Tenenbaum , title =. 2020 , url =

work page 2020
[46]

Russell , title =

Lisa Anne Hendricks and Oliver Wang and Eli Shechtman and Josef Sivic and Trevor Darrell and Bryan C. Russell , title =

work page
[47]

ArXiv , year=

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models , author=. ArXiv , year=

work page
[48]

ArXiv , year=

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark , author=. ArXiv , year=

work page
[49]

ArXiv , year=

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. ArXiv , year=

work page
[50]

ArXiv , year=

MMBench: Is Your Multi-modal Model an All-around Player? , author=. ArXiv , year=

work page
[51]

ArXiv , year=

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities , author=. ArXiv , year=

work page
[52]

Proceedings of the 25th ACM international conference on Multimedia , year=

Video Question Answering via Gradually Refined Attention over Appearance and Motion , author=. Proceedings of the 25th ACM international conference on Multimedia , year=

work page
[53]

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions , author=. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2021
[54]

2021 , url=

STAR: A Benchmark for Situated Reasoning in Real-World Videos , author=. 2021 , url=

work page 2021
[55]

ArXiv , year=

Perception Test: A Diagnostic Benchmark for Multimodal Video Models , author=. ArXiv , year=

work page
[56]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Revisiting the “Video” in Video-Language Understanding , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2022
[57]

ArXiv , year=

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. ArXiv , year=

work page
[58]

Something Something

The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , author=. 2017 IEEE International Conference on Computer Vision (ICCV) , year=

work page 2017
[59]

2021 IEEE Winter Conference on Applications of Computer Vision (WACV) , year=

Only Time Can Tell: Discovering Temporal Data for Temporal Modeling , author=. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) , year=

work page 2021
[60]

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Test of Time: Instilling Video-Language Models with a Sense of Time , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2023
[61]

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , author=. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2017
[62]

ArXiv , year=

TouchStone: Evaluating Vision-Language Models by Language Models , author=. ArXiv , year=

work page
[63]

ArXiv , year=

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models , author=. ArXiv , year=

work page
[64]

ArXiv , year=

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models , author=. ArXiv , year=

work page
[65]

2024 , booktitle=

ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models , author=. 2024 , booktitle=

work page 2024
[66]

ArXiv , year=

VLM-Eval: A General Evaluation on Video Large Language Models , author=. ArXiv , year=

work page
[67]

ArXiv , year=

Revealing Single Frame Bias for Video-and-Language Learning , author=. ArXiv , year=

work page
[68]

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets , author=. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

work page 2018
[69]

ArXiv , year=

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning , author=. ArXiv , year=

work page
[70]

Piyush Bagad, Makarand Tapaswi, and Cees G. M. Snoek. 2023. https://api.semanticscholar.org/CorpusID:255440354 Test of time: Instilling video-language models with a sense of time . 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2503--2516

work page 2023
[71]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenhang Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, K. Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023 b . https://api.semanticscholar.org/CorpusID:261101015 Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond . ArXiv, abs/2308.12966

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xing Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, and Jingren Zhou. 2023 c . https://api.semanticscholar.org/CorpusID:261397179 Touchstone: Evaluating vision-language models by language models . ArXiv, abs/2308.16890

work page arXiv 2023
[74]

Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1708--1718

work page 2021
[75]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

work page 2020
[76]

Buch, Cristobal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles

S. Buch, Cristobal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. 2022. https://api.semanticscholar.org/CorpusID:249375461 Revisiting the “video” in video-language understanding . 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2907--2917

work page 2022
[77]

Xiuyuan Chen, Yuan Lin, Yuchen Zhang, and Weiran Huang. 2023. https://api.semanticscholar.org/CorpusID:265456806 Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering . ArXiv, abs/2311.14906

work page arXiv 2023
[78]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. https://lmsys.org/blog/2023-03-30-vicuna/ Vicuna: An open-source chatbot impressing gpt-4 with 90\

work page 2023
[79]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. 2023. https://api.semanticscholar.org/CorpusID:258615266 Instructblip: Towards general-purpose vision-language models with instruction tuning . ArXiv, abs/2305.06500

work page internal anchor Pith review Pith/arXiv arXiv 2023
[80]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. https://api.semanticscholar.org/CorpusID:225039882 An image is worth 16x16 words: Transformers for image recognition at scale . ArXiv, abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2020

Showing first 80 references.