super hub Baseline reference

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Lei Li, Renrui Zhang, Shuhuai Ren, Yongdong Luo, Yuhan Dai · 2024 · cs.CV · DOI 10.1609/icwsm.v17i1.22209 · arXiv 2405.21075

Baseline reference. 58% of citing Pith papers use this work as a benchmark or comparison.

118 Pith papers citing it

Baseline 58% of classified citations

open full Pith review browse 118 citing papers more from Chaoyou Fu arXiv PDF

abstract

In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. 900 videos with a total of 254 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models. Our dataset along with these findings underscores the need for further improvements in handling longer sequences and multi-modal data. Project Page: https://video-mme.github.io

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

dataset 18 background 12 baseline 1

citation-polarity summary

use dataset 17 background 12 baseline 1 unclear 1

claims ledger

abstract In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing

authors

Chaoyou Fu Lei Li Renrui Zhang Shuhuai Ren Yongdong Luo Yuhan Dai

co-cited works

representative citing papers

ReQuest: Rethinking-based Question-Aware Frame Selection for Long-Form Video QA

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

ReQuest introduces an uncertainty-driven question-adaptive keyframe selector with rethinking routing and adaptive NMS that boosts long-form video QA accuracy on Video-MME, MLVU, and LongVideoBench without fine-tuning the base MLLM.

No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

Future-L1 interleaves latent visual spans with text in MLLM decoding, trained on a custom Future-L1-50K dataset via LA-DAPO RL, and reports SOTA gains on FutureBench (61.0 to 85.4) and TwiFF-Bench (2.44 to 3.04).

VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

VideoKR supplies 315K knowledge-intensive video reasoning examples and a dedicated benchmark, with experiments indicating post-training gains on reasoning tasks that require both video content and external knowledge.

NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

NextMotionQA benchmark reveals VLMs have critical gaps in fine-grained human motion understanding and align with experts on coarse judgment (κ=0.70) but not fine-grained (κ=0.10).

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

cs.CV · 2026-05-30 · unverdicted · novelty 7.0 · 2 refs

Introduces pause-and-think-T dataset and pause-and-think-B benchmark; fine-tunes 4B VLM to 58% accuracy matching 235B model while generalizing out-of-distribution.

SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence

cs.CV · 2026-05-29 · unverdicted · novelty 7.0 · 2 refs

SVI-Bench provides 35K hours of sports video with 9 tasks across four cognitive levels, revealing models drop from ~74% on action QA to 5% on agentic evidence integration.

ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

ST-SimDiff is a training-free method using a spatio-temporal graph and dual similarity-difference selection to compress video tokens for MLLMs while retaining static and dynamic content.

An Efficient Streaming Video Understanding Framework with Agentic Control

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

R3-Streaming uses cascaded control with age-aware memory forgetting and TB-GRPO reinforcement learning to reach SOTA scores of 57.92 on OVO-Bench and 76.36 on StreamingBench with 95-96% fewer visual tokens.

Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.

CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

cs.MM · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

cs.CV · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.

FCMBench-Video: Benchmarking Document Video Intelligence

cs.CV · 2026-04-28 · unverdicted · novelty 7.0

FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.

Mosaic: Cross-Modal Clustering for Efficient Video Understanding

cs.PF · 2026-04-11 · unverdicted · novelty 7.0

Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.

CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

CrashSight is a new infrastructure-focused benchmark showing that state-of-the-art vision-language models can describe crash scenes but fail at temporal and causal reasoning.

AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.

VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can outperform specialized streaming models.

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

cs.CV · 2026-01-22 · unverdicted · novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.

ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

cs.CL · 2025-09-03 · unverdicted · novelty 7.0

ProMQA-Assembly is a new multimodal procedural QA dataset with 646 pairs on assembly activities, built via LLM-generated candidates verified by humans plus 81 task graphs, and used to benchmark multimodal models.

SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning

cs.CV · 2025-06-05 · conditional · novelty 7.0

SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.

citing papers explorer

Showing 50 of 118 citing papers.

ReQuest: Rethinking-based Question-Aware Frame Selection for Long-Form Video QA cs.CV · 2026-07-02 · unverdicted · none · ref 8 · internal anchor
ReQuest introduces an uncertainty-driven question-adaptive keyframe selector with rethinking routing and adaptive NMS that boosts long-form video QA accuracy on Video-MME, MLVU, and LongVideoBench without fine-tuning the base MLLM.
No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs cs.CV · 2026-06-30 · unverdicted · none · ref 16 · internal anchor
Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.
Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients cs.CL · 2026-06-16 · unverdicted · none · ref 153 · internal anchor
ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.
Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction cs.CV · 2026-06-04 · unverdicted · none · ref 14 · internal anchor
Future-L1 interleaves latent visual spans with text in MLLM decoding, trained on a custom Future-L1-50K dataset via LA-DAPO RL, and reports SOTA gains on FutureBench (61.0 to 85.4) and TwiFF-Bench (2.44 to 3.04).
VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding cs.CV · 2026-06-03 · unverdicted · none · ref 5 · internal anchor
VideoKR supplies 315K knowledge-intensive video reasoning examples and a dedicated benchmark, with experiments indicating post-training gains on reasoning tasks that require both video content and external knowledge.
NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models cs.CV · 2026-06-03 · unverdicted · none · ref 46 · internal anchor
NextMotionQA benchmark reveals VLMs have critical gaps in fine-grained human motion understanding and align with experts on coarse judgment (κ=0.70) but not fine-grained (κ=0.10).
Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion cs.CV · 2026-05-30 · unverdicted · none · ref 5 · 2 links · internal anchor
Introduces pause-and-think-T dataset and pause-and-think-B benchmark; fine-tunes 4B VLM to 58% accuracy matching 235B model while generalizing out-of-distribution.
SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence cs.CV · 2026-05-29 · unverdicted · none · ref 21 · 2 links · internal anchor
SVI-Bench provides 35K hours of sports video with 9 tasks across four cognitive levels, revealing models drop from ~74% on action QA to 5% on agentic evidence integration.
ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs cs.AI · 2026-05-21 · unverdicted · none · ref 5 · internal anchor
ST-SimDiff is a training-free method using a spatio-temporal graph and dual similarity-difference selection to compress video tokens for MLLMs while retaining static and dynamic content.
An Efficient Streaming Video Understanding Framework with Agentic Control cs.CV · 2026-05-18 · unverdicted · none · ref 31 · internal anchor
R3-Streaming uses cascaded control with age-aware memory forgetting and TB-GRPO reinforcement learning to reach SOTA scores of 57.92 on OVO-Bench and 76.36 on StreamingBench with 95-96% fewer visual tokens.
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding cs.CV · 2026-05-14 · unverdicted · none · ref 15 · internal anchor
Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding cs.CV · 2026-05-14 · unverdicted · none · ref 18 · internal anchor
CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding cs.CV · 2026-05-13 · unverdicted · none · ref 5 · internal anchor
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation cs.MM · 2026-05-12 · unverdicted · none · ref 52 · 2 links · internal anchor
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing cs.CV · 2026-05-05 · unverdicted · none · ref 10 · 2 links · internal anchor
VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
FCMBench-Video: Benchmarking Document Video Intelligence cs.CV · 2026-04-28 · unverdicted · none · ref 3 · internal anchor
FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.
Mosaic: Cross-Modal Clustering for Efficient Video Understanding cs.PF · 2026-04-11 · unverdicted · none · ref 25 · internal anchor
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning cs.CV · 2026-04-09 · unverdicted · none · ref 7 · internal anchor
CrashSight is a new infrastructure-focused benchmark showing that state-of-the-art vision-language models can describe crash scenes but fail at temporal and causal reasoning.
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 11 · internal anchor
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models cs.CV · 2026-04-08 · unverdicted · none · ref 7 · internal anchor
VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can outperform specialized streaming models.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration cs.CV · 2026-04-06 · unverdicted · none · ref 9 · internal anchor
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning cs.CV · 2026-01-22 · unverdicted · none · ref 8 · internal anchor
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly cs.CL · 2025-09-03 · unverdicted · none · ref 12 · internal anchor
ProMQA-Assembly is a new multimodal procedural QA dataset with 646 pairs on assembly activities, built via LLM-generated candidates verified by humans plus 81 task graphs, and used to benchmark multimodal models.
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning cs.CV · 2025-06-05 · conditional · none · ref 13 · internal anchor
SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? cs.CV · 2025-05-27 · conditional · none · ref 18 · internal anchor
Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning cs.CV · 2025-04-02 · unverdicted · none · ref 10 · internal anchor
SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
Video-R1: Reinforcing Video Reasoning in MLLMs cs.CV · 2025-03-27 · conditional · none · ref 9 · internal anchor
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
Unified Reward Model for Multimodal Understanding and Generation cs.CV · 2025-03-07 · unverdicted · none · ref 59 · internal anchor
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs cs.CV · 2025-02-06 · unverdicted · none · ref 18 · internal anchor
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos cs.CV · 2025-01-23 · unverdicted · none · ref 9 · internal anchor
Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
VidHal: Benchmarking Temporal Hallucinations in Vision LLMs cs.CV · 2024-11-25 · unverdicted · none · ref 13 · internal anchor
VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.
LVBench: An Extreme Long Video Understanding Benchmark cs.CV · 2024-06-12 · accept · none · ref 9 · internal anchor
LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
MLVU: Benchmarking Multi-task Long Video Understanding cs.CV · 2024-06-06 · conditional · none · ref 13 · internal anchor
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
MedStreamBench: A Time-Aware Benchmark for Streaming and Proactive Medical Video Understanding cs.CV · 2026-07-02 · unverdicted · none · ref 3 · internal anchor
MedStreamBench integrates 22 medical datasets into 5,419 QA instances across retrospective, present, future, and proactive temporal settings to evaluate streaming and proactive medical video understanding.
From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA cs.CV · 2026-06-29 · unverdicted · none · ref 8 · internal anchor
Audit of four VideoQA benchmarks reveals text-only shortcuts in VLMs; new diagnostics Blind Gap, Visual Gain, and Shortcut Score quantify and filter visual dependence.
HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models cs.CV · 2026-06-25 · unverdicted · none · ref 5 · 2 links · internal anchor
HarmVideoBench is a multi-layered benchmark for harmful video understanding in LVLMs with three hierarchical dimensions, and BCR is a method that raises average model performance from 61.7% to 84.4%.
Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse cs.DC · 2026-06-22 · unverdicted · none · ref 44 · internal anchor
Kamera stores a low-rank patch with each position-free KV chunk to restore cross-chunk conditioning lost in naive reuse, enabling cheap reordering, sliding windows, and recall across attention mechanisms.
HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning cs.CV · 2026-06-19 · unverdicted · none · ref 212 · internal anchor
HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.
How Well Can Your Video Model Remember? Measuring Memory-Budget Trade-offs in Long Video Understanding cs.CV · 2026-06-17 · unverdicted · none · ref 13 · internal anchor
Fits a model where logit-accuracy scales linearly in log frame budget B with distance-dependent exponent α(D) that decays log-linearly with temporal distance D, based on 155k binary predictions across ten models.
Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation cs.CV · 2026-06-15 · unverdicted · none · ref 18 · internal anchor
Qwen-RobotWorld is a language-conditioned video world model using Double-Stream MMDiT, an 8.6M-frame embodied corpus, and progressive curriculum training that ranks first on EWMBench and DreamGen Bench.
Harnessing Streaming Video in the Wild cs.CV · 2026-06-07 · unverdicted · none · ref 11 · internal anchor
Presents Streaming-Train-248K dataset, Streaming Harness system, and Streaming-Eval benchmark to enable VLMs for proactive, memory-equipped streaming video understanding.
VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning cs.CV · 2026-06-04 · unverdicted · none · ref 10 · internal anchor
VTI-CoT proposes a visual-textual interleaved chain-of-thought method for video reasoning, built via automated annotation and OCR compression, claiming SOTA performance and better training efficiency on same-scale models.
SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series cs.CL · 2026-06-02 · unverdicted · none · ref 1 · internal anchor
SagaQA is a new benchmark for long-range multi-hop reasoning in entire TV series, with evaluation showing hybrid planners outperform others on complex narrative tasks.
MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention cs.CV · 2026-06-01 · unverdicted · none · ref 39 · internal anchor
MOSS-Video-Preview introduces a cross-attention architecture and synthesized real-time QA data to enable continuous perception, answer revision, and faster inference in video-language models compared to decoder-only designs.
PEEK: Picking Essential frames via Efficient Knowledge distillation cs.CV · 2026-05-29 · unverdicted · none · ref 9 · internal anchor
PEEK distills caption-conditioned frame relevance into a lightweight visual model, outperforming adaptive baselines on ActivityNet Captions and MSR-VTT especially at 1-2 frame budgets while adding only 5.2% overhead.
TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation cs.CL · 2026-05-29 · unverdicted · none · ref 5 · internal anchor
TeachObs provides human-validated segment- and lesson-level labels on 30 classroom videos and finds frontier VLMs inconsistent, frame-inflated, and over-generous versus expert raters.
Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding cs.AI · 2026-05-21 · unverdicted · none · ref 2 · internal anchor
ST-GridPool improves video LLM performance via hierarchical temporal gridding and norm-based spatial pooling on visual tokens without training.
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding cs.CV · 2026-05-19 · unverdicted · none · ref 8 · 3 links · internal anchor
FineBench is a large-scale human-centric VQA benchmark exposing weaknesses in open VLMs for fine-grained activity understanding, with FineAgent providing a practical enhancement method.
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding cs.CV · 2026-05-18 · unverdicted · none · ref 20 · internal anchor
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation cs.CV · 2026-05-16 · unverdicted · none · ref 2 · 2 links · internal anchor
TRACE improves multi-video event understanding by grounding evidence in structured timelines before visual reasoning, raising MiRAGE F1 from 0.705 to 0.811 on MAGMaR 2026.

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer