ST-SimDiff is a training-free method using a spatio-temporal graph and dual similarity-difference selection to compress video tokens for MLLMs while retaining static and dynamic content.
hub Baseline reference
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Baseline reference. 59% of citing Pith papers use this work as a benchmark or comparison.
abstract
In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. 900 videos with a total of 254 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models. Our dataset along with these findings underscores the need for further improvements in handling longer sequences and multi-modal data. Project Page: https://video-mme.github.io
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing
co-cited works
representative citing papers
FineBench is a new dense VQA benchmark for fine-grained human activity in long videos that exposes weaknesses in open VLMs and demonstrates gains from the proposed FineAgent modular framework.
Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.
A topology-aware pruning framework models layer representation evolution in LVLMs via simplicial complexes and zigzag persistent homology to enable adaptive removal of layers while outperforming existing methods on multimodal benchmarks.
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
CrashSight is a new infrastructure-focused benchmark showing that state-of-the-art vision-language models can describe crash scenes but fail at temporal and causal reasoning.
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can outperform specialized streaming models.
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
ProMQA-Assembly is a new multimodal procedural QA dataset with 646 pairs on assembly activities, built via LLM-generated candidates verified by humans plus 81 task graphs, and used to benchmark multimodal models.
SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.
Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
citing papers explorer
-
ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs
ST-SimDiff is a training-free method using a spatio-temporal graph and dual similarity-difference selection to compress video tokens for MLLMs while retaining static and dynamic content.
-
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
FineBench is a new dense VQA benchmark for fine-grained human activity in long videos that exposes weaknesses in open VLMs and demonstrates gains from the proposed FineAgent modular framework.
-
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
-
CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
-
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
-
FCMBench-Video: Benchmarking Document Video Intelligence
FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.
-
Topology-Aware Layer Pruning for Large Vision-Language Models
A topology-aware pruning framework models layer representation evolution in LVLMs via simplicial complexes and zigzag persistent homology to enable adaptive removal of layers while outperforming existing methods on multimodal benchmarks.
-
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
-
CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning
CrashSight is a new infrastructure-focused benchmark showing that state-of-the-art vision-language models can describe crash scenes but fail at temporal and causal reasoning.
-
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
-
VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models
VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can outperform specialized streaming models.
-
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
-
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
-
TrajTok: Learning Trajectory Tokens enables better Video Understanding
TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.
-
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
-
ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly
ProMQA-Assembly is a new multimodal procedural QA dataset with 646 pairs on assembly activities, built via LLM-generated candidates verified by humans plus 81 task graphs, and used to benchmark multimodal models.
-
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.
-
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
-
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
-
Video-R1: Reinforcing Video Reasoning in MLLMs
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
-
Unified Reward Model for Multimodal Understanding and Generation
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
-
VidHal: Benchmarking Temporal Hallucinations in Vision LLMs
VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.
-
LVBench: An Extreme Long Video Understanding Benchmark
LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
-
MLVU: Benchmarking Multi-task Long Video Understanding
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
-
Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding
ST-GridPool improves video LLM performance via hierarchical temporal gridding and norm-based spatial pooling on visual tokens without training.
-
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
-
An Efficient Streaming Video Understanding Framework with Agentic Control
R3-Streaming uses cascaded control, age-aware memory forgetting, and TB-GRPO reinforcement learning to reach SOTA scores on streaming video benchmarks while cutting visual token usage by 95-96%.
-
TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation
TRACE builds structured text timelines from videos via OCR and detection, then applies text-only LLM evidence localization before LVLM claim generation, raising MiRAGE F1 from 0.705 to 0.811 on MAGMaR.
-
Personal Visual Context Learning in Large Multimodal Models
Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.
-
MMTB: Evaluating Terminal Agents on Multimedia-File Tasks
MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.
-
Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models
VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditioned video model.
-
Video-ToC: Video Tree-of-Cue Reasoning
Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
-
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.
-
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents
Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.
-
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.
-
Streaming Video Instruction Tuning
Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
-
StreamingVLM: Real-Time Understanding for Infinite Video Streams
StreamingVLM enables stable real-time understanding of infinite video streams at up to 8 FPS using a streaming KV cache and aligned SFT on overlapped chunks, with a 66.18% win rate over GPT-4O mini on a new two-hour video benchmark.
-
Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents
CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
-
ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs
ReGATE introduces a teacher-student adaptive token elision method that reduces training tokens to 38% while matching or exceeding baseline accuracy on multimodal benchmarks.
-
ChipSeek: Optimizing Verilog Generation via EDA-Integrated Reinforcement Learning
ChipSeek is a hierarchical-reward reinforcement learning framework with Curriculum-Guided Dynamic Policy Optimization that integrates EDA simulator feedback to improve LLM-generated RTL code on both functional correctness and PPA metrics.
-
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
-
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
TrajViT tokenizes videos via panoptic sub-object trajectories, achieving 10x token reduction and outperforming ViT3D by 6% on retrieval and 5.2% on VideoQA tasks with faster training and inference.
-
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.
-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.