hub Mixed citations

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo · 2023 · cs.CV · arXiv 2305.06355

Mixed citation behavior. Most common role is background (64%).

74 Pith papers citing it

Background 64% of classified citations

open full Pith review browse 74 citing papers arXiv PDF

abstract

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 16 baseline 7 dataset 1 method 1

citation-polarity summary

background 16 baseline 7 use dataset 1 use method 1

claims ledger

abstract In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset
baseline 2 Related Work Multimodal Large Language Models. With the impressive success of Large language models (LLM) [1, 5, 4], recent studies work on generative Multimodal Large Language Models (MLLMs) [6, 7, 8, 9, 10, 11, 12, 13, 14, 18, 19, 20, 21] to improve multimodal comprehension and generation through utilizing the strong generality of LLMs. Some work [ 15, 16, 17] further considers video inputs and leverage the vast capabilities of LLMs for video understanding tasks. In SEED-Bench, we provide a
baseline Video → Text Text → Videomethod #F R@1 R@5 R@10 R@1 R@5 R@10 avg. OpenAI CLIP-L [117] 1 27.8 49.4 58.0 29.0 50.5 59.2 45.7 InternVL-C (ours) 1 35.3 56.6 66.6 37.5 60.9 70.9 54.6 InternVL-G (ours) 1 36.6 58.3 67.7 39.1 61.7 70.7 55.7 OpenAI CLIP-L [117] 8 26.6 50.8 61.8 30.7 54.4 64.0 48.1 Florence [171] 8 - - - 37.6 63.8 72.6 - InternVideo† [151] 8 39.6 - - 40.7 - - - UMT-L† [83] 8 38.6 59.8 69.6 42.6 64.4 73.1 58.0 LanguageBind† [186] 8 40.9 66.4 75.7 44.8 70.0 78.7 62.8 InternVL-C (ours) 8 40.
background MLLMs-based video translation. By structuring our survey in this role-oriented manner (see Fig. 1), we aim to provide conceptual clarity and facilitate comparative analysis. The arXiv:2604.11283v1 [cs.CV] 13 Apr 2026 2 Taxonomy The SemanticReasoner Video-Language Alignment MiniGPT4-Video [1], FrozenBiLM [2], Video-ChatGPT [3], Video-LLaMA [4], VideoChat [5], LLaMA-VID [6], Valley [7], Vista-LLaMA [8], IG-VLM [9], VideoChat2 [10], VaQuitA [11], Vamos [12], COSMO [13], IVA [14], MMICT [15], LXMERT
baseline 4 / 55.0 61.0 1.39 56.6 53.0 54.1 Qwen2-VL-2B [138] 55.6 / 60.4 63.2 - - - - Qwen2.5-VL-3B [5] 61.5 / 67.6 67.0 1.63 68.2 43.3 60.3 InternVL3-2B [187] 58.9 / 61.4 70.4 1.42 64.2 55.4 59.6 InternVL3.5-2B 58.4 / 61.9 65.9 1.56 64.4 57.4 60.0 MiniCPM-V-4-4B [164] 61.2 / 65.8 58.7 - - - - InternVL3.5-4B 65.4 / 68.6 71.2 1.59 70.4 60.8 64.9 VideoChat2-HD [62] 45.3 / 55.7 62.3 1.22 47.9 - - LLaV A-OneVision-7B [58] 58.2 / - 56.7 - - - - MiniCPM-V-2.6 [164] 60.9 / 63.6 - 1.70 - 54.9 - Qwen2-VL-7B [138]
dataset Img-Diff (en) [101], Birds-to-Words (en) [100], Spot-the-Diff (en) [100], MultiVQA (en) [100], NLVR2 (en) [216],General QA ContrastiveCaption (en) [100], DreamSim (en) [100], InternVL-SA-1B-Caption (en & zh) [36] Document MP-DocVQA (en) [233], MP-Docmatix (en) [121] Type: Video Datasets Vript (en & zh) [269], OpenVid (en) [190], Mementos (en) [254], ShareGPT4o-Video (en & zh) [35],Captioning ShareGPT4Video (en & zh) [30], VideoGPT+ (en) [174] VideoChat2-IT (en & zh) [130, 131], EgoTaskQA (en) [9
background improved long video understanding accuracy. 2 Related Work Vision-Language Models for Long Sequence Understanding.Early Vision-Language Models (VLMs), such as GPT-4V and Gemini-1.5 [49, 58], showcased powerful multimodal reasoning by integrating visual encoders with large language models. Open-source efforts like Llama-Vid [36], IDEFICS [24], VideoChat [34], Video-LLaMA [12], and others [2, 32, 35, 38, 44, 61, 62] have further advanced capabilities, often matching or exceeding proprietary system

co-cited works

representative citing papers

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Introduces CaST-Bench, a dataset of 2,066 causal questions on 1,015 videos with annotated causal chains and metrics to evaluate VLMs on spatio-temporal causal reasoning.

AffectVerse: Emotional World Models for Multimodal Affective Computing

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and belief aggregation.

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

cs.CV · 2026-05-17 · conditional · novelty 7.0

Omni-DuplexEval creates a new benchmark and LLM-as-a-Judge framework for real-time duplex omni-modal interaction, revealing that current models score below 40% overall and struggle especially with proactive responses.

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

OZ-TAL: Online Zero-Shot Temporal Action Localization

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.

IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

A contract-based multi-agent system maintains a claim-level semantic memory for long videos, enabling targeted corrections that raise VQA accuracy from 0.71 to 0.79 and cut human arbitration cost by 4.8x on VidOR.

OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

cs.CV · 2026-01-22 · unverdicted · novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

cs.CV · 2025-11-27 · unverdicted · novelty 7.0

Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.

EMCompress: Video-LLMs with Endomorphic Multimodal Compression

cs.CV · 2025-08-27 · unverdicted · novelty 7.0

EMCompress introduces EMC as an endomorphic sufficient-statistic transformation for VideoQA that preserves answer invariance, releases a dedicated benchmark, and reports a ReSimplifyIt baseline with 0.40 F-1 gains plus efficiency improvements.

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

cs.CV · 2025-05-27 · conditional · novelty 7.0

Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

cs.CV · 2024-12-23 · unverdicted · novelty 7.0

HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion perception and cross-modal alignment.

VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

cs.CV · 2024-11-25 · unverdicted · novelty 7.0

VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

cs.CV · 2024-07-10 · unverdicted · novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.

MLVU: Benchmarking Multi-task Long Video Understanding

cs.CV · 2024-06-06 · conditional · novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

cs.CL · 2023-07-30 · unverdicted · novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.

Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

ST-GridPool improves video LLM performance via hierarchical temporal gridding and norm-based spatial pooling on visual tokens without training.

DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

DynaTok introduces temporally adaptive budget allocation with EMA memory and spatial selection with memory to compress video tokens, retaining over 95% accuracy at 90% reduction on VideoQA benchmarks.

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.

OProver: A Unified Framework for Agentic Formal Theorem Proving

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.

Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs

cs.CV · 2026-05-02 · unverdicted · novelty 6.0

VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.

Seeing Fast and Slow: Learning the Flow of Time in Videos

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.

citing papers explorer

Showing 50 of 74 citing papers.

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering cs.CV · 2026-05-22 · unverdicted · none · ref 21 · internal anchor
Introduces CaST-Bench, a dataset of 2,066 causal questions on 1,015 videos with annotated causal chains and metrics to evaluate VLMs on spatio-temporal causal reasoning.
AffectVerse: Emotional World Models for Multimodal Affective Computing cs.CV · 2026-05-19 · unverdicted · none · ref 18 · internal anchor
AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and belief aggregation.
Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction cs.CV · 2026-05-17 · conditional · none · ref 19 · internal anchor
Omni-DuplexEval creates a new benchmark and LLM-as-a-Judge framework for real-time duplex omni-modal interaction, revealing that current models score below 40% overall and struggle especially with proactive responses.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding cs.CV · 2026-05-13 · unverdicted · none · ref 26 · internal anchor
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
OZ-TAL: Online Zero-Shot Temporal Action Localization cs.CV · 2026-05-11 · unverdicted · none · ref 50 · internal anchor
Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.
IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory cs.CV · 2026-04-22 · unverdicted · none · ref 4 · internal anchor
A contract-based multi-agent system maintains a claim-level semantic memory for long videos, enabling targeted corrections that raise VQA accuracy from 0.71 to 0.79 and cut human arbitration cost by 4.8x on VidOR.
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning cs.CV · 2026-04-18 · unverdicted · none · ref 22 · internal anchor
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration cs.CV · 2026-04-06 · unverdicted · none · ref 14 · internal anchor
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning cs.CV · 2026-01-22 · unverdicted · none · ref 16 · internal anchor
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance? cs.CV · 2025-11-27 · unverdicted · none · ref 33 · internal anchor
Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
EMCompress: Video-LLMs with Endomorphic Multimodal Compression cs.CV · 2025-08-27 · unverdicted · none · ref 3 · internal anchor
EMCompress introduces EMC as an endomorphic sufficient-statistic transformation for VideoQA that preserves answer invariance, releases a dedicated benchmark, and reports a ReSimplifyIt baseline with 0.40 F-1 gains plus efficiency improvements.
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? cs.CV · 2025-05-27 · conditional · none · ref 23 · internal anchor
Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks cs.CV · 2024-12-23 · unverdicted · none · ref 29 · internal anchor
HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion perception and cross-modal alignment.
VidHal: Benchmarking Temporal Hallucinations in Vision LLMs cs.CV · 2024-11-25 · unverdicted · none · ref 27 · internal anchor
VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models cs.CV · 2024-07-10 · unverdicted · none · ref 29 · internal anchor
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
MLVU: Benchmarking Multi-task Long Video Understanding cs.CV · 2024-06-06 · conditional · none · ref 25 · internal anchor
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension cs.CL · 2023-07-30 · unverdicted · none · ref 15 · internal anchor
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding cs.AI · 2026-05-21 · unverdicted · none · ref 5 · internal anchor
ST-GridPool improves video LLM performance via hierarchical temporal gridding and norm-based spatial pooling on visual tokens without training.
DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs cs.CV · 2026-05-19 · unverdicted · none · ref 11 · internal anchor
DynaTok introduces temporally adaptive budget allocation with EMA memory and spatial selection with memory to compress video tokens, retaining over 95% accuracy at 90% reduction on VideoQA benchmarks.
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding cs.CV · 2026-05-18 · unverdicted · none · ref 34 · internal anchor
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
OProver: A Unified Framework for Agentic Formal Theorem Proving cs.CL · 2026-05-17 · unverdicted · none · ref 75 · internal anchor
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs cs.CV · 2026-05-02 · unverdicted · none · ref 36 · internal anchor
VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.
Seeing Fast and Slow: Learning the Flow of Time in Videos cs.CV · 2026-04-23 · unverdicted · none · ref 32 · internal anchor
Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding cs.CV · 2026-04-15 · unverdicted · none · ref 34 · internal anchor
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.
ViLL-E: Video LLM Embeddings for Retrieval cs.CV · 2026-04-13 · unverdicted · none · ref 25 · internal anchor
ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs cs.CV · 2026-04-13 · unverdicted · none · ref 40 · internal anchor
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM cs.CV · 2026-03-29 · unverdicted · none · ref 7 · internal anchor
Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling cs.CV · 2026-03-24 · unverdicted · none · ref 25 · internal anchor
ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval cs.CV · 2025-12-09 · unverdicted · none · ref 21 · internal anchor
OneClip-RAG enables MLLMs to handle long videos via one-shot clip retrieval and unified chunking-retrieval, delivering performance gains like matching GPT-5 level on MLVU with high efficiency on standard GPUs.
Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding cs.CV · 2025-12-07 · conditional · none · ref 37 · internal anchor
DEViL offloads spatial grounding to a detector via a distilled reference-semantic token and temporal consistency regularization, reaching 43.1% m_vIoU at 14.33 FPS on HC-STVG.
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models cs.CV · 2025-11-18 · conditional · none · ref 22 · internal anchor
OmniZip introduces an audio-guided dynamic token compression framework that achieves 3.42X inference speedup and 1.4X memory reduction for omnimodal LLMs without any training.
Cambrian-S: Towards Spatial Supersensing in Video cs.CV · 2025-11-06 · unverdicted · none · ref 69 · internal anchor
Cambrian-S introduces VSI-SUPER benchmarks for long-horizon spatial recall and counting, shows data scaling yields 30% gains on existing tests, and demonstrates a self-supervised next-latent predictor using surprise outperforms baselines on the new spatial supersensing tasks.
Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents cs.CV · 2025-09-29 · unverdicted · none · ref 18 · internal anchor
CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 62 · internal anchor
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning cs.CV · 2025-08-13 · unverdicted · none · ref 16 · internal anchor
GoViG decomposes goal-conditioned navigation instruction generation into visual state prediction and instruction synthesis using an autoregressive multimodal LLM with one-pass and interleaved reasoning, showing gains on a new R2R-Goal dataset.
B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding cs.CV · 2025-08-07 · unverdicted · none · ref 15 · internal anchor
B4DL provides a new benchmark, scalable data generation pipeline, and MLLM architecture for direct spatio-temporal reasoning on raw 4D LiDAR data.
Training-Free Multimodal Large Language Model Orchestration cs.CL · 2025-08-06 · unverdicted · none · ref 26 · 2 links · internal anchor
LLM Orchestration integrates modality experts via an LLM controller, cross-modal memory, and interaction layer to enable multimodal input-output without gradient-based training.
UniMind: Unleashing the Power of LLMs for Unified Multi-Task Brain Decoding cs.HC · 2025-06-23 · unverdicted · none · ref 44 · internal anchor
UniMind unifies multi-task brain decoding from EEG by bridging signals to LLMs via a Neuro-Language Connector and dynamic task queries, outperforming prior models by 12% on average across ten datasets.
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence cs.CV · 2025-05-29 · unverdicted · none · ref 43 · 2 links · internal anchor
Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models cs.CV · 2025-04-14 · conditional · none · ref 64 · internal anchor
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos cs.CV · 2025-01-07 · conditional · none · ref 52 · internal anchor
Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models cs.CV · 2025-01-06 · unverdicted · none · ref 22 · internal anchor
MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling cs.CV · 2024-12-31 · unverdicted · none · ref 27 · internal anchor
VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks cs.RO · 2024-12-09 · unverdicted · none · ref 49 · internal anchor
Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 130 · internal anchor
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
LLaVA-Video: Video Instruction Tuning With Synthetic Data cs.CV · 2024-10-03 · unverdicted · none · ref 30 · internal anchor
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
TempCompass: Do Video LLMs Really Understand Videos? cs.CV · 2024-03-01 · unverdicted · none · ref 96 · internal anchor
TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark cs.CV · 2023-11-28 · accept · none · ref 42 · internal anchor
MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection cs.CV · 2023-11-16 · unverdicted · none · ref 63 · internal anchor
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation cs.CV · 2023-07-13 · unverdicted · none · ref 10 · internal anchor
InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.

VideoChat: Chat-Centric Video Understanding

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer