Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.
hub Mixed citations
VideoChat: Chat-Centric Video Understanding
Mixed citation behavior. Most common role is background (64%).
abstract
In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset
- baseline 2 Related Work Multimodal Large Language Models. With the impressive success of Large language models (LLM) [1, 5, 4], recent studies work on generative Multimodal Large Language Models (MLLMs) [6, 7, 8, 9, 10, 11, 12, 13, 14, 18, 19, 20, 21] to improve multimodal comprehension and generation through utilizing the strong generality of LLMs. Some work [ 15, 16, 17] further considers video inputs and leverage the vast capabilities of LLMs for video understanding tasks. In SEED-Bench, we provide a
- baseline Video → Text Text → Videomethod #F R@1 R@5 R@10 R@1 R@5 R@10 avg. OpenAI CLIP-L [117] 1 27.8 49.4 58.0 29.0 50.5 59.2 45.7 InternVL-C (ours) 1 35.3 56.6 66.6 37.5 60.9 70.9 54.6 InternVL-G (ours) 1 36.6 58.3 67.7 39.1 61.7 70.7 55.7 OpenAI CLIP-L [117] 8 26.6 50.8 61.8 30.7 54.4 64.0 48.1 Florence [171] 8 - - - 37.6 63.8 72.6 - InternVideo† [151] 8 39.6 - - 40.7 - - - UMT-L† [83] 8 38.6 59.8 69.6 42.6 64.4 73.1 58.0 LanguageBind† [186] 8 40.9 66.4 75.7 44.8 70.0 78.7 62.8 InternVL-C (ours) 8 40.
- background MLLMs-based video translation. By structuring our survey in this role-oriented manner (see Fig. 1), we aim to provide conceptual clarity and facilitate comparative analysis. The arXiv:2604.11283v1 [cs.CV] 13 Apr 2026 2 Taxonomy The SemanticReasoner Video-Language Alignment MiniGPT4-Video [1], FrozenBiLM [2], Video-ChatGPT [3], Video-LLaMA [4], VideoChat [5], LLaMA-VID [6], Valley [7], Vista-LLaMA [8], IG-VLM [9], VideoChat2 [10], VaQuitA [11], Vamos [12], COSMO [13], IVA [14], MMICT [15], LXMERT
- baseline 4 / 55.0 61.0 1.39 56.6 53.0 54.1 Qwen2-VL-2B [138] 55.6 / 60.4 63.2 - - - - Qwen2.5-VL-3B [5] 61.5 / 67.6 67.0 1.63 68.2 43.3 60.3 InternVL3-2B [187] 58.9 / 61.4 70.4 1.42 64.2 55.4 59.6 InternVL3.5-2B 58.4 / 61.9 65.9 1.56 64.4 57.4 60.0 MiniCPM-V-4-4B [164] 61.2 / 65.8 58.7 - - - - InternVL3.5-4B 65.4 / 68.6 71.2 1.59 70.4 60.8 64.9 VideoChat2-HD [62] 45.3 / 55.7 62.3 1.22 47.9 - - LLaV A-OneVision-7B [58] 58.2 / - 56.7 - - - - MiniCPM-V-2.6 [164] 60.9 / 63.6 - 1.70 - 54.9 - Qwen2-VL-7B [138]
- dataset Img-Diff (en) [101], Birds-to-Words (en) [100], Spot-the-Diff (en) [100], MultiVQA (en) [100], NLVR2 (en) [216],General QA ContrastiveCaption (en) [100], DreamSim (en) [100], InternVL-SA-1B-Caption (en & zh) [36] Document MP-DocVQA (en) [233], MP-Docmatix (en) [121] Type: Video Datasets Vript (en & zh) [269], OpenVid (en) [190], Mementos (en) [254], ShareGPT4o-Video (en & zh) [35],Captioning ShareGPT4Video (en & zh) [30], VideoGPT+ (en) [174] VideoChat2-IT (en & zh) [130, 131], EgoTaskQA (en) [9
- background improved long video understanding accuracy. 2 Related Work Vision-Language Models for Long Sequence Understanding.Early Vision-Language Models (VLMs), such as GPT-4V and Gemini-1.5 [49, 58], showcased powerful multimodal reasoning by integrating visual encoders with large language models. Open-source efforts like Llama-Vid [36], IDEFICS [24], VideoChat [34], Video-LLaMA [12], and others [2, 32, 35, 38, 44, 61, 62] have further advanced capabilities, often matching or exceeding proprietary system
co-cited works
representative citing papers
NEST is a new benchmark dataset for narrative event structures in long videos, with baselines reporting ETD below 8%, EL under 6%, EAE below 11%, and ERE at 35-44% F1.
RNG-Bench evaluates MLLMs on hidden-observation reconstruction in non-Markov games, finds forgetting as the dominant error source, and shows fine-tuning on optimal rollouts improves performance with transfer to other benchmarks.
Future-L1 interleaves latent visual spans with text in MLLM decoding, trained on a custom Future-L1-50K dataset via LA-DAPO RL, and reports SOTA gains on FutureBench (61.0 to 85.4) and TwiFF-Bench (2.44 to 3.04).
SelfBootTok decomposes image tokens into global and local groups via self-bootstrapped learning, enabling generators to use only global tokens for ~40% less computation and a new SOTA gFID of 1.56 with 64 tokens.
EvoCut is a training-free visual token compression technique that identifies important tokens via multi-layer evolution deviation, retaining 11.1% tokens with 94.4% average performance preserved on LLaVA-1.5-7B.
CaST-Bench creates a benchmark with causal-chain annotations and novel metrics showing that current VLMs struggle to construct precise grounded causal chains in video QA.
AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and belief aggregation.
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.
A contract-based multi-agent system maintains a claim-level semantic memory for long videos, enabling targeted corrections that raise VQA accuracy from 0.71 to 0.79 and cut human arbitration cost by 4.8x on VidOR.
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
EMCompress introduces EMC as an endomorphic sufficient-statistic transformation for VideoQA that preserves answer invariance, releases a dedicated benchmark, and reports a ReSimplifyIt baseline with 0.40 F-1 gains plus efficiency improvements.
Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion perception and cross-modal alignment.
VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.
HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.
citing papers explorer
-
Balancing Image Compression and Generation with Bootstrapped Tokenization
SelfBootTok decomposes image tokens into global and local groups via self-bootstrapped learning, enabling generators to use only global tokens for ~40% less computation and a new SOTA gFID of 1.56 with 64 tokens.