hub Mixed citations

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo · 2023 · cs.CV · arXiv 2305.06355

Mixed citation behavior. Most common role is background (64%).

90 Pith papers citing it

Background 64% of classified citations

open full Pith review browse 90 citing papers arXiv PDF

abstract

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 16 baseline 7 dataset 1 method 1

citation-polarity summary

background 16 baseline 7 use dataset 1 use method 1

claims ledger

abstract In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset
baseline 2 Related Work Multimodal Large Language Models. With the impressive success of Large language models (LLM) [1, 5, 4], recent studies work on generative Multimodal Large Language Models (MLLMs) [6, 7, 8, 9, 10, 11, 12, 13, 14, 18, 19, 20, 21] to improve multimodal comprehension and generation through utilizing the strong generality of LLMs. Some work [ 15, 16, 17] further considers video inputs and leverage the vast capabilities of LLMs for video understanding tasks. In SEED-Bench, we provide a
baseline Video → Text Text → Videomethod #F R@1 R@5 R@10 R@1 R@5 R@10 avg. OpenAI CLIP-L [117] 1 27.8 49.4 58.0 29.0 50.5 59.2 45.7 InternVL-C (ours) 1 35.3 56.6 66.6 37.5 60.9 70.9 54.6 InternVL-G (ours) 1 36.6 58.3 67.7 39.1 61.7 70.7 55.7 OpenAI CLIP-L [117] 8 26.6 50.8 61.8 30.7 54.4 64.0 48.1 Florence [171] 8 - - - 37.6 63.8 72.6 - InternVideo† [151] 8 39.6 - - 40.7 - - - UMT-L† [83] 8 38.6 59.8 69.6 42.6 64.4 73.1 58.0 LanguageBind† [186] 8 40.9 66.4 75.7 44.8 70.0 78.7 62.8 InternVL-C (ours) 8 40.
background MLLMs-based video translation. By structuring our survey in this role-oriented manner (see Fig. 1), we aim to provide conceptual clarity and facilitate comparative analysis. The arXiv:2604.11283v1 [cs.CV] 13 Apr 2026 2 Taxonomy The SemanticReasoner Video-Language Alignment MiniGPT4-Video [1], FrozenBiLM [2], Video-ChatGPT [3], Video-LLaMA [4], VideoChat [5], LLaMA-VID [6], Valley [7], Vista-LLaMA [8], IG-VLM [9], VideoChat2 [10], VaQuitA [11], Vamos [12], COSMO [13], IVA [14], MMICT [15], LXMERT
baseline 4 / 55.0 61.0 1.39 56.6 53.0 54.1 Qwen2-VL-2B [138] 55.6 / 60.4 63.2 - - - - Qwen2.5-VL-3B [5] 61.5 / 67.6 67.0 1.63 68.2 43.3 60.3 InternVL3-2B [187] 58.9 / 61.4 70.4 1.42 64.2 55.4 59.6 InternVL3.5-2B 58.4 / 61.9 65.9 1.56 64.4 57.4 60.0 MiniCPM-V-4-4B [164] 61.2 / 65.8 58.7 - - - - InternVL3.5-4B 65.4 / 68.6 71.2 1.59 70.4 60.8 64.9 VideoChat2-HD [62] 45.3 / 55.7 62.3 1.22 47.9 - - LLaV A-OneVision-7B [58] 58.2 / - 56.7 - - - - MiniCPM-V-2.6 [164] 60.9 / 63.6 - 1.70 - 54.9 - Qwen2-VL-7B [138]
dataset Img-Diff (en) [101], Birds-to-Words (en) [100], Spot-the-Diff (en) [100], MultiVQA (en) [100], NLVR2 (en) [216],General QA ContrastiveCaption (en) [100], DreamSim (en) [100], InternVL-SA-1B-Caption (en & zh) [36] Document MP-DocVQA (en) [233], MP-Docmatix (en) [121] Type: Video Datasets Vript (en & zh) [269], OpenVid (en) [190], Mementos (en) [254], ShareGPT4o-Video (en & zh) [35],Captioning ShareGPT4Video (en & zh) [30], VideoGPT+ (en) [174] VideoChat2-IT (en & zh) [130, 131], EgoTaskQA (en) [9
background improved long video understanding accuracy. 2 Related Work Vision-Language Models for Long Sequence Understanding.Early Vision-Language Models (VLMs), such as GPT-4V and Gemini-1.5 [49, 58], showcased powerful multimodal reasoning by integrating visual encoders with large language models. Open-source efforts like Llama-Vid [36], IDEFICS [24], VideoChat [34], Video-LLaMA [12], and others [2, 32, 35, 38, 44, 61, 62] have further advanced capabilities, often matching or exceeding proprietary system

co-cited works

representative citing papers

No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.

NEST: Narrative Event Structures in Time for Long Video Understanding

cs.CV · 2026-06-18 · unverdicted · novelty 7.0

NEST is a new benchmark dataset for narrative event structures in long videos, with baselines reporting ETD below 8%, EL under 6%, EAE below 11%, and ERE at 35-44% F1.

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

cs.CV · 2026-06-17 · unverdicted · novelty 7.0

RNG-Bench evaluates MLLMs on hidden-observation reconstruction in non-Markov games, finds forgetting as the dominant error source, and shows fine-tuning on optimal rollouts improves performance with transfer to other benchmarks.

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

Future-L1 interleaves latent visual spans with text in MLLM decoding, trained on a custom Future-L1-50K dataset via LA-DAPO RL, and reports SOTA gains on FutureBench (61.0 to 85.4) and TwiFF-Bench (2.44 to 3.04).

Balancing Image Compression and Generation with Bootstrapped Tokenization

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

SelfBootTok decomposes image tokens into global and local groups via self-bootstrapped learning, enabling generators to use only global tokens for ~40% less computation and a new SOTA gFID of 1.56 with 64 tokens.

EvoCut: Multi-Layer Evolution-Aware Visual Token Compression for Efficient Large Vision-Language Models

cs.CV · 2026-06-01 · conditional · novelty 7.0

EvoCut is a training-free visual token compression technique that identifies important tokens via multi-layer evolution deviation, retaining 11.1% tokens with 94.4% average performance preserved on LLaVA-1.5-7B.

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

cs.CV · 2026-05-22 · unverdicted · novelty 7.0 · 2 refs

CaST-Bench creates a benchmark with causal-chain annotations and novel metrics showing that current VLMs struggle to construct precise grounded causal chains in video QA.

AffectVerse: Emotional World Models for Multimodal Affective Computing

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and belief aggregation.

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

OZ-TAL: Online Zero-Shot Temporal Action Localization

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.

IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

A contract-based multi-agent system maintains a claim-level semantic memory for long videos, enabling targeted corrections that raise VQA accuracy from 0.71 to 0.79 and cut human arbitration cost by 4.8x on VidOR.

OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

cs.CV · 2026-01-22 · unverdicted · novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

cs.CV · 2025-11-27 · unverdicted · novelty 7.0

Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.

EMCompress: Video-LLMs with Endomorphic Multimodal Compression

cs.CV · 2025-08-27 · unverdicted · novelty 7.0

EMCompress introduces EMC as an endomorphic sufficient-statistic transformation for VideoQA that preserves answer invariance, releases a dedicated benchmark, and reports a ReSimplifyIt baseline with 0.40 F-1 gains plus efficiency improvements.

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

cs.CV · 2025-05-27 · conditional · novelty 7.0

Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

cs.CV · 2024-12-23 · unverdicted · novelty 7.0

HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion perception and cross-modal alignment.

VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

cs.CV · 2024-11-25 · unverdicted · novelty 7.0

VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

cs.CV · 2024-07-10 · unverdicted · novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.

MLVU: Benchmarking Multi-task Long Video Understanding

cs.CV · 2024-06-06 · conditional · novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

cs.CL · 2023-07-30 · unverdicted · novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.

VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.

HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning

cs.CV · 2026-06-19 · unverdicted · novelty 6.0

HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.

citing papers explorer

Showing 50 of 90 citing papers.

No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs cs.CV · 2026-06-30 · unverdicted · none · ref 38 · internal anchor
Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.
NEST: Narrative Event Structures in Time for Long Video Understanding cs.CV · 2026-06-18 · unverdicted · none · ref 282 · internal anchor
NEST is a new benchmark dataset for narrative event structures in long videos, with baselines reporting ETD below 8%, EL under 6%, EAE below 11%, and ERE at 35-44% F1.
Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games cs.CV · 2026-06-17 · unverdicted · none · ref 37 · internal anchor
RNG-Bench evaluates MLLMs on hidden-observation reconstruction in non-Markov games, finds forgetting as the dominant error source, and shows fine-tuning on optimal rollouts improves performance with transfer to other benchmarks.
Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction cs.CV · 2026-06-04 · unverdicted · none · ref 4 · internal anchor
Future-L1 interleaves latent visual spans with text in MLLM decoding, trained on a custom Future-L1-50K dataset via LA-DAPO RL, and reports SOTA gains on FutureBench (61.0 to 85.4) and TwiFF-Bench (2.44 to 3.04).
Balancing Image Compression and Generation with Bootstrapped Tokenization cs.LG · 2026-06-04 · unverdicted · none · ref 40 · internal anchor
SelfBootTok decomposes image tokens into global and local groups via self-bootstrapped learning, enabling generators to use only global tokens for ~40% less computation and a new SOTA gFID of 1.56 with 64 tokens.
EvoCut: Multi-Layer Evolution-Aware Visual Token Compression for Efficient Large Vision-Language Models cs.CV · 2026-06-01 · conditional · none · ref 23 · internal anchor
EvoCut is a training-free visual token compression technique that identifies important tokens via multi-layer evolution deviation, retaining 11.1% tokens with 94.4% average performance preserved on LLaVA-1.5-7B.
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering cs.CV · 2026-05-22 · unverdicted · none · ref 21 · 2 links · internal anchor
CaST-Bench creates a benchmark with causal-chain annotations and novel metrics showing that current VLMs struggle to construct precise grounded causal chains in video QA.
AffectVerse: Emotional World Models for Multimodal Affective Computing cs.CV · 2026-05-19 · unverdicted · none · ref 18 · internal anchor
AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and belief aggregation.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding cs.CV · 2026-05-13 · unverdicted · none · ref 26 · internal anchor
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
OZ-TAL: Online Zero-Shot Temporal Action Localization cs.CV · 2026-05-11 · unverdicted · none · ref 50 · internal anchor
Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.
IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory cs.CV · 2026-04-22 · unverdicted · none · ref 4 · internal anchor
A contract-based multi-agent system maintains a claim-level semantic memory for long videos, enabling targeted corrections that raise VQA accuracy from 0.71 to 0.79 and cut human arbitration cost by 4.8x on VidOR.
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning cs.CV · 2026-04-18 · unverdicted · none · ref 22 · internal anchor
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration cs.CV · 2026-04-06 · unverdicted · none · ref 14 · internal anchor
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning cs.CV · 2026-01-22 · unverdicted · none · ref 16 · internal anchor
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance? cs.CV · 2025-11-27 · unverdicted · none · ref 33 · internal anchor
Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
EMCompress: Video-LLMs with Endomorphic Multimodal Compression cs.CV · 2025-08-27 · unverdicted · none · ref 3 · internal anchor
EMCompress introduces EMC as an endomorphic sufficient-statistic transformation for VideoQA that preserves answer invariance, releases a dedicated benchmark, and reports a ReSimplifyIt baseline with 0.40 F-1 gains plus efficiency improvements.
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? cs.CV · 2025-05-27 · conditional · none · ref 23 · internal anchor
Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks cs.CV · 2024-12-23 · unverdicted · none · ref 29 · internal anchor
HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion perception and cross-modal alignment.
VidHal: Benchmarking Temporal Hallucinations in Vision LLMs cs.CV · 2024-11-25 · unverdicted · none · ref 27 · internal anchor
VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models cs.CV · 2024-07-10 · unverdicted · none · ref 29 · internal anchor
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
MLVU: Benchmarking Multi-task Long Video Understanding cs.CV · 2024-06-06 · conditional · none · ref 25 · internal anchor
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension cs.CL · 2023-07-30 · unverdicted · none · ref 15 · internal anchor
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context cs.CV · 2026-06-29 · unverdicted · none · ref 19 · internal anchor
VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.
HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning cs.CV · 2026-06-19 · unverdicted · none · ref 154 · internal anchor
HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.
Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding cs.CV · 2026-06-05 · unverdicted · none · ref 3 · internal anchor
LyraV uses FDTC and SToP for per-frame incremental decoding to reach 98.29% video synchrony at 3.89 FPS while preserving general understanding.
AdaCodec: A Predictive Visual Code for Video MLLMs cs.CV · 2026-06-01 · unverdicted · none · ref 1 · internal anchor
AdaCodec introduces a predictive visual code that cuts visual token use in video MLLMs by sending full frames only on high predictive cost and otherwise encoding inter-frame changes as P-tokens, yielding better benchmark scores at lower budgets.
V-LynX: Token Interface Alignment for Video+X LLMs cs.CV · 2026-05-30 · unverdicted · none · ref 4 · internal anchor
V-LynX integrates novel modalities into frozen Video LLMs by aligning to an internalized continuous token manifold using unpaired unimodal data and attention/statistical matching.
Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding cs.AI · 2026-05-21 · unverdicted · none · ref 5 · internal anchor
ST-GridPool improves video LLM performance via hierarchical temporal gridding and norm-based spatial pooling on visual tokens without training.
DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs cs.CV · 2026-05-19 · unverdicted · none · ref 11 · internal anchor
DynaTok introduces temporally adaptive budget allocation with EMA memory and spatial selection with memory to compress video tokens, retaining over 95% accuracy at 90% reduction on VideoQA benchmarks.
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding cs.CV · 2026-05-18 · unverdicted · none · ref 34 · internal anchor
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction cs.CV · 2026-05-17 · unverdicted · none · ref 19 · 2 links · internal anchor
Omni-DuplexEval provides a new benchmark and automatic evaluation method for real-time duplex omni-modal interaction, showing state-of-the-art models reach only 39.6% overall and 20% on proactive reminders.
OProver: A Unified Framework for Agentic Formal Theorem Proving cs.CL · 2026-05-17 · unverdicted · none · ref 75 · internal anchor
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation cs.CV · 2026-05-16 · unverdicted · none · ref 4 · 2 links · internal anchor
TRACE improves multi-video event understanding by grounding evidence in structured timelines before visual reasoning, raising MiRAGE F1 from 0.705 to 0.811 on MAGMaR 2026.
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs cs.CV · 2026-05-02 · unverdicted · none · ref 36 · internal anchor
VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.
Seeing Fast and Slow: Learning the Flow of Time in Videos cs.CV · 2026-04-23 · unverdicted · none · ref 32 · internal anchor
Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding cs.CV · 2026-04-15 · unverdicted · none · ref 34 · internal anchor
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.
ViLL-E: Video LLM Embeddings for Retrieval cs.CV · 2026-04-13 · unverdicted · none · ref 25 · internal anchor
ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs cs.CV · 2026-04-13 · unverdicted · none · ref 40 · internal anchor
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM cs.CV · 2026-03-29 · unverdicted · none · ref 7 · internal anchor
Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling cs.CV · 2026-03-24 · unverdicted · none · ref 25 · internal anchor
ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval cs.CV · 2025-12-09 · unverdicted · none · ref 21 · internal anchor
OneClip-RAG enables MLLMs to handle long videos via one-shot clip retrieval and unified chunking-retrieval, delivering performance gains like matching GPT-5 level on MLVU with high efficiency on standard GPUs.
Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding cs.CV · 2025-12-07 · conditional · none · ref 37 · internal anchor
DEViL offloads spatial grounding to a detector via a distilled reference-semantic token and temporal consistency regularization, reaching 43.1% m_vIoU at 14.33 FPS on HC-STVG.
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models cs.CV · 2025-11-18 · conditional · none · ref 22 · internal anchor
OmniZip introduces an audio-guided dynamic token compression framework that achieves 3.42X inference speedup and 1.4X memory reduction for omnimodal LLMs without any training.
Cambrian-S: Towards Spatial Supersensing in Video cs.CV · 2025-11-06 · unverdicted · none · ref 69 · internal anchor
Cambrian-S introduces VSI-SUPER benchmarks for long-horizon spatial recall and counting, shows data scaling yields 30% gains on existing tests, and demonstrates a self-supervised next-latent predictor using surprise outperforms baselines on the new spatial supersensing tasks.
Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents cs.CV · 2025-09-29 · unverdicted · none · ref 18 · internal anchor
CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 62 · internal anchor
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning cs.CV · 2025-08-13 · unverdicted · none · ref 16 · internal anchor
GoViG decomposes goal-conditioned navigation instruction generation into visual state prediction and instruction synthesis using an autoregressive multimodal LLM with one-pass and interleaved reasoning, showing gains on a new R2R-Goal dataset.
B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding cs.CV · 2025-08-07 · unverdicted · none · ref 15 · internal anchor
B4DL provides a new benchmark, scalable data generation pipeline, and MLLM architecture for direct spatio-temporal reasoning on raw 4D LiDAR data.
Training-Free Multimodal Large Language Model Orchestration cs.CL · 2025-08-06 · unverdicted · none · ref 26 · 2 links · internal anchor
LLM Orchestration integrates modality experts via an LLM controller, cross-modal memory, and interaction layer to enable multimodal input-output without gradient-based training.
UniMind: Unleashing the Power of LLMs for Unified Multi-Task Brain Decoding cs.HC · 2025-06-23 · unverdicted · none · ref 44 · internal anchor
UniMind unifies multi-task brain decoding from EEG by bridging signals to LLMs via a Neuro-Language Connector and dynamic task queries, outperforming prior models by 12% on average across ten datasets.

VideoChat: Chat-Centric Video Understanding

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer