hub Mixed citations

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao · 2022 · cs.CV · arXiv 2212.03191

Mixed citation behavior. Most common role is background (60%).

27 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 27 citing papers arXiv PDF

abstract

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 1 method 1

citation-polarity summary

background 3 baseline 1 use method 1

representative citing papers

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

cs.CV · 2026-04-30 · unverdicted · novelty 7.0

TransVLM formalizes Shot Transition Detection as identifying full temporal transition segments rather than single cut points and introduces a VLM that injects optical flow as a motion prior via simple feature fusion, plus a synthetic data engine and benchmark.

Training-Free Semantic Multi-Object Tracking with Vision-Language Models

cs.CV · 2026-04-15 · conditional · novelty 7.0

TF-SMOT composes pretrained vision-language models into a training-free pipeline that reaches state-of-the-art tracking and improved summary quality on the BenSMOT benchmark.

V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

V-Nutri fuses final-dish features with cooking-process keyframes from egocentric videos to improve dish-level calorie and macronutrient estimation over single-image baselines.

InstrAct: Towards Action-Centric Understanding in Instructional Videos

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on a new InstrAct Bench for semantic, procedural, and retrieval tasks.

A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

Fully end-to-end training with a sentence-conditioned adapter outperforms frozen-backbone baselines for localizing video segments that match sentence queries.

LRM: Large Reconstruction Model for Single Image to 3D

cs.CV · 2023-11-08 · conditional · novelty 7.0

LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.

VideoChat: Chat-Centric Video Understanding

cs.CV · 2023-05-10 · conditional · novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

Towards Unified Vision-Language Models with Incomplete Multi-Modal Inputs

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

Proposes the first unified incomplete video-language model that processes missing modalities and serves as a plug-and-play module to boost existing VLMs on multi-modal tasks.

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.

FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

FreqFormer applies heterogeneous attention (dense global on low frequencies, block-sparse on mid, local on high) plus adaptive spectral routing to reduce attention cost in long-sequence video diffusion transformers.

UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

UniversalVTG is a lightweight foundation model for video temporal grounding that achieves state-of-the-art results across five benchmarks while being over 100 times smaller than recent MLLM-based methods.

Streaming Video Instruction Tuning

cs.CV · 2025-12-24 · unverdicted · novelty 6.0

Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.

Calibrated Multimodal Representation Learning with Missing Modalities

cs.CV · 2025-11-15 · unverdicted · novelty 6.0

CalMRL mitigates anchor shift in multimodal representation learning by calibrating incomplete alignments through representation-level imputation of missing modalities using priors and a bi-step optimization with closed-form shared latent posteriors.

Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding

cs.CV · 2025-11-11 · conditional · novelty 6.0

A plug-and-play Anonymizing Adapter Module removes private information from video latent features using self-supervised privacy objectives and consistency losses while retaining utility on action recognition, temporal detection, and anomaly tasks.

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

cs.CV · 2024-12-31 · unverdicted · novelty 6.0

VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.

Revisiting Feature Prediction for Learning Visual Representations from Video

cs.CV · 2024-02-15 · conditional · novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

cs.CV · 2023-11-28 · accept · novelty 6.0

MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

cs.CV · 2023-10-03 · unverdicted · novelty 6.0

LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

cs.CV · 2023-07-13 · unverdicted · novelty 6.0

InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.

TuringViT: Making SOTA Vision Transformers Accessible to All

cs.CV · 2026-06-23 · unverdicted · novelty 5.0

TuringViT claims a new ViT design with linear attention and curated data that matches SOTA performance using 10% of typical pretraining data while supporting dynamic resolutions and improving VLM integration.

Masked Diffusion Vision-Language Models for Temporal Action Localization

cs.CV · 2026-05-28 · unverdicted · novelty 5.0

Adapts MDVLMs to TAL via planned training objective and step-level IoU reward, reporting gains over autoregressive baselines on ActivityNet and THUMOS datasets.

LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)

cs.CV · 2026-05-06 · conditional · novelty 5.0

The PhyScore challenge creates the first benchmark requiring metrics to jointly score video quality, physical realism, condition alignment, and temporal consistency while localizing physical anomalies in 1554 videos from seven generative models across text-to-2D, image-to-4D, and video-to-4D tracks.

Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

A new adapter module combining boundary-aware state space modeling with spatial processing boosts localization and robustness in temporal action detection.

citing papers explorer

Showing 1 of 1 citing paper after filters.

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks cs.CV · 2023-12-21 · unverdicted · none · ref 152 · internal anchor
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer