Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.
hub Mixed citations
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Mixed citation behavior. Most common role is background (60%).
abstract
The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .
hub tools
citation-role summary
citation-polarity summary
fields
cs.CV 35representative citing papers
CapRL++ applies reinforcement learning with verifiable rewards to dense image and video captioning by scoring captions via the accuracy of a vision-free LLM answering MCQs from the caption alone.
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
TransVLM formalizes Shot Transition Detection as identifying full temporal transition segments rather than single cut points and introduces a VLM that injects optical flow as a motion prior via simple feature fusion, plus a synthetic data engine and benchmark.
TF-SMOT composes pretrained vision-language models into a training-free pipeline that reaches state-of-the-art tracking and improved summary quality on the BenSMOT benchmark.
V-Nutri fuses final-dish features with cooking-process keyframes from egocentric videos to improve dish-level calorie and macronutrient estimation over single-image baselines.
InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on a new InstrAct Bench for semantic, procedural, and retrieval tasks.
Fully end-to-end training with a sentence-conditioned adapter outperforms frozen-backbone baselines for localizing video segments that match sentence queries.
LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
OmniGen-AR is a unified autoregressive framework for any-to-image generation that tokenizes text and visual conditions together and uses disentangled causal attention to support tasks like text-to-image, depth-to-image, image editing, and text-to-video while reporting 0.63 on GenEval and 80.02 on VB
LyraV uses FDTC and SToP for per-frame incremental decoding to reach 98.29% video synchrony at 3.89 FPS while preserving general understanding.
Reaction-Diffusion Multimodal Fusion (RDMF) applies the Gray-Scott model to video-text alignment for language-guided moment retrieval, claiming better adaptive modeling than static attention.
SlotMemory decomposes transformer KV into discrete semantic slots for entity-level persistence in streaming long-video generation, reporting 81.61 quality and 22.8% dynamic consistency gain on 60-second interactive videos.
Proposes the first unified incomplete video-language model that processes missing modalities and serves as a plug-and-play module to boost existing VLMs on multi-modal tasks.
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.
FreqFormer applies heterogeneous attention (dense global on low frequencies, block-sparse on mid, local on high) plus adaptive spectral routing to reduce attention cost in long-sequence video diffusion transformers.
UniversalVTG is a lightweight foundation model for video temporal grounding that achieves state-of-the-art results across five benchmarks while being over 100 times smaller than recent MLLM-based methods.
Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
CalMRL mitigates anchor shift in multimodal representation learning by calibrating incomplete alignments through representation-level imputation of missing modalities using priors and a bi-step optimization with closed-form shared latent posteriors.
A plug-and-play Anonymizing Adapter Module removes private information from video latent features using self-supervised privacy objectives and consistency losses while retaining utility on action recognition, temporal detection, and anomaly tasks.
VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.
citing papers explorer
-
Streaming Video Instruction Tuning
Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
-
Calibrated Multimodal Representation Learning with Missing Modalities
CalMRL mitigates anchor shift in multimodal representation learning by calibrating incomplete alignments through representation-level imputation of missing modalities using priors and a bi-step optimization with closed-form shared latent posteriors.
-
Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding
A plug-and-play Anonymizing Adapter Module removes private information from video latent features using self-supervised privacy objectives and consistency losses while retaining utility on action recognition, temporal detection, and anomaly tasks.
-
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning
TempR1 applies temporal-aware multi-task RL using GRPO and three types of localization rewards to achieve SOTA temporal understanding in MLLMs with synergistic gains from joint optimization.