TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Bocheng Zou; Fangrui Zhu; Feng Yao; Jaden Park; Jianfeng Gao; Jianrui Zhang; Jianwei Yang; Jing Gu; Kai Zhang; Mu Cai

arxiv: 2410.10818 · v2 · pith:DEIB5PUNnew · submitted 2024-10-14 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Mu Cai , Reuben Tan , Jianrui Zhang , Bocheng Zou , Kai Zhang , Feng Yao , Fangrui Zhu , Jing Gu

show 7 more authors

Yiwu Zhong Yuzhang Shang Yao Dou Jaden Park Jianfeng Gao Yong Jae Lee Jianwei Yang

This is my paper

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords temporalvideomodelsunderstandingtemporalbenchfine-grainedevaluatingmultimodal

0 comments

read the original abstract

Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models' temporal reasoning capabilities. Both dataset and evaluation code will be made available.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
cs.CV 2026-04 unverdicted novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...
Animation2Code: Evaluating Temporal Visual Reasoning in Video-to-Code Generation
cs.CV 2026-06 unverdicted novelty 7.0

Animation2Code benchmark with 1,069 videos tests VLMs on generating animation code, showing persistent failures in temporal consistency despite good visual matches.
MAOAM: Unified Object and Material Selection with Vision-Language Models
cs.CV 2026-06 unverdicted novelty 7.0

MAOAM unifies object and material selection via a VLM with segmentation head, supporting text and click interactions through multi-task training on VLM-generated material data.
Benchmarking Visual State Tracking in Multimodal Video Understanding
cs.CV 2026-06 unverdicted novelty 7.0

VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
YoCausal: How Far is Video Generation from World Model? A Causality Perspective
cs.CV 2026-05 unverdicted novelty 7.0

YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
cs.CV 2026-05 unverdicted novelty 7.0

FineBench is a new dense VQA benchmark for fine-grained human activity in long videos that exposes weaknesses in open VLMs and demonstrates gains from the proposed FineAgent modular framework.
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
cs.CV 2026-05 unverdicted novelty 7.0

FineBench is a new dense VQA benchmark for fine-grained human activity understanding in long videos, revealing weaknesses in open VLMs and showing that FineAgent improves them via localization and description modules.
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
cs.CV 2026-05 unverdicted novelty 7.0

TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
cs.CV 2026-05 conditional novelty 7.0

TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
cs.CV 2026-05 unverdicted novelty 7.0

VideoNet is a new large-scale benchmark and training dataset for domain-specific action recognition that exposes limitations in VLMs and enables smaller fine-tuned models to surpass larger open-weight ones.
GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models
cs.CV 2026-04 unverdicted novelty 7.0

GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
cs.CV 2025-12 unverdicted novelty 7.0

AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.
Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs
cs.CL 2025-06 unverdicted novelty 7.0

VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
cs.CV 2025-01 unverdicted novelty 7.0

Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
APT: Atomic Physical Transitions for Causal Video-Language Understanding
cs.CV 2026-06 unverdicted novelty 6.0

Introduces APT chains as ordered causal transition sequences and APT-Tune to improve VLM transition detection while preserving event-level performance.
TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation
cs.CL 2026-05 unverdicted novelty 6.0

TeachObs is a new human-validated benchmark dataset and evaluation protocol for multimodal AI on classroom teaching observation, showing no model dominates across tracks and that models over-rate procedurally clear lessons.
IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams
cs.CV 2026-05 unverdicted novelty 6.0

IPIBench evaluates MLLMs on interactive proactive intelligence in streaming videos, identifies unstable triggering and poor coordination, and proposes the training-free IPI-Agent framework to improve performance acros...
The TIME Machine: On The Power of Motion for Efficient Perception
cs.CV 2026-05 unverdicted novelty 6.0

TIME is a motion-based embedding from point tracks, trained only on synthetic data via masked autoencoding, that matches state-of-the-art video model performance with up to 10,000x less training data.
Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 6.0

CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
cs.CV 2026-05 unverdicted novelty 6.0

FineBench is a large-scale human-centric VQA benchmark exposing weaknesses in open VLMs for fine-grained activity understanding, with FineAgent providing a practical enhancement method.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
cs.CV 2025-11 unverdicted novelty 6.0

LongVT adds native video-cropping tool calling to LMMs for interleaved multimodal chain-of-tool-thought reasoning on long videos and releases VideoSIAH data for training and evaluation.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
cs.AI 2025-06 unverdicted novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding
cs.CV 2025-05 unverdicted novelty 6.0

MUSEG applies timestamp-aware multi-segment grounding with a phased-reward RL recipe to boost temporal grounding and time-sensitive video QA performance in MLLMs.
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.