pith. sign in

hub

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it
abstract

With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf{3,300 human-hours} and up to \textbf{5 rounds} of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.

hub tools

citation-role summary

background 3 dataset 1

citation-polarity summary

fields

cs.CV 16 cs.CR 1

years

2026 17

clear filters

representative citing papers

Harnessing Streaming Video in the Wild

cs.CV · 2026-06-07 · unverdicted · novelty 6.0

Presents Streaming-Train-248K dataset, Streaming Harness system, and Streaming-Eval benchmark to enable VLMs for proactive, memory-equipped streaming video understanding.

VISD: Enhancing Video Reasoning via Structured Self-Distillation

cs.CV · 2026-05-07 · unverdicted · novelty 5.0 · 4 refs

VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.

Kwai Keye-VL-2.0 Technical Report

cs.CV · 2026-06-09 · unverdicted · novelty 4.0

Kwai Keye-VL-2.0-30B-A3B is a 30B MoE model with 3B active parameters using DSA adaptation and MOPD distillation that reports SOTA results on video understanding and agent benchmarks.

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

cs.CV · 2026-05-25 · unverdicted · novelty 4.0

LLaVA-OV-2 uses codec-stream tokenization and a shared 3D RoPE to improve video, spatial, and tracking performance over Qwen3-VL-8B, while introducing the JumpScore benchmark for fine-grained motion localization.

EasyVideoR1: Easier RL for Video Understanding

cs.CV · 2026-04-18 · unverdicted · novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

citing papers explorer

Showing 13 of 13 citing papers after filters.

  • TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos cs.CV · 2026-05-08 · unverdicted · none · ref 72 · internal anchor

    TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

  • Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction cs.CV · 2026-06-04 · unverdicted · none · ref 48 · internal anchor

    Future-L1 interleaves latent visual spans with text in MLLM decoding, trained on a custom Future-L1-50K dataset via LA-DAPO RL, and reports SOTA gains on FutureBench (61.0 to 85.4) and TwiFF-Bench (2.44 to 3.04).

  • Benchmarking Visual State Tracking in Multimodal Video Understanding cs.CV · 2026-06-02 · unverdicted · none · ref 18 · internal anchor

    VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.

  • VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding cs.CV · 2026-05-21 · unverdicted · none · ref 9 · internal anchor

    VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.

  • GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions cs.CV · 2026-05-15 · unverdicted · none · ref 22 · internal anchor

    GRASP is a large-scale dataset and benchmark for social reasoning grounded in gaze and gesture events in multi-person videos, with Social Grounding Reward (SGR) proposed to improve model performance on GRASP-Bench.

  • GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs cs.CV · 2026-05-11 · unverdicted · none · ref 15 · internal anchor

    GridProbe uses posterior probing on a KxK frame grid to adaptively select question-relevant frames, delivering up to 3.36x TFLOPs reduction with accuracy within 1.6 pp of the full-frame baseline on Video-MME-v2.

  • Harnessing Streaming Video in the Wild cs.CV · 2026-06-07 · unverdicted · none · ref 12 · internal anchor

    Presents Streaming-Train-248K dataset, Streaming Harness system, and Streaming-Eval benchmark to enable VLMs for proactive, memory-equipped streaming video understanding.

  • LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs cs.CV · 2026-05-17 · unverdicted · none · ref 2 · 2 links · internal anchor

    LiteFrame is an efficient vision encoder backbone trained with Compressed Token Distillation and Language Model Adaptation to scale frame count in Video LLMs while cutting latency and raising accuracy.

  • VISD: Enhancing Video Reasoning via Structured Self-Distillation cs.CV · 2026-05-07 · unverdicted · none · ref 10 · 4 links · internal anchor

    VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.

  • InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning cs.CV · 2026-06-10 · unverdicted · none · ref 287 · internal anchor

    InternVideo3 introduces Multimodal Contextual Reasoning and M^2LA attention to enable closed-loop evidence accumulation in long-video understanding and agentic tool use, reporting strong benchmark results.

  • Kwai Keye-VL-2.0 Technical Report cs.CV · 2026-06-09 · unverdicted · none · ref 8 · internal anchor

    Kwai Keye-VL-2.0-30B-A3B is a 30B MoE model with 3B active parameters using DSA adaptation and MOPD distillation that reports SOTA results on video understanding and agent benchmarks.

  • LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence cs.CV · 2026-05-25 · unverdicted · none · ref 13 · internal anchor

    LLaVA-OV-2 uses codec-stream tokenization and a shared 3D RoPE to improve video, spatial, and tracking performance over Qwen3-VL-8B, while introducing the JumpScore benchmark for fine-grained motion localization.

  • EasyVideoR1: Easier RL for Video Understanding cs.CV · 2026-04-18 · unverdicted · none · ref 12 · internal anchor

    EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.