pith. sign in

hub Mixed citations

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Mixed citation behavior. Most common role is background (60%).

35 Pith papers citing it
Background 60% of classified citations
abstract

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .

hub tools

citation-role summary

background 3 baseline 1 method 1

citation-polarity summary

fields

cs.CV 35

clear filters

representative citing papers

InstrAct: Towards Action-Centric Understanding in Instructional Videos

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on a new InstrAct Bench for semantic, procedural, and retrieval tasks.

VideoChat: Chat-Centric Video Understanding

cs.CV · 2023-05-10 · conditional · novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

OmniGen-AR: AutoRegressive Any-to-Image Generation

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

OmniGen-AR is a unified autoregressive framework for any-to-image generation that tokenizes text and visual conditions together and uses disentangled causal attention to support tasks like text-to-image, depth-to-image, image editing, and text-to-video while reporting 0.63 on GenEval and 80.02 on VB

Streaming Video Instruction Tuning

cs.CV · 2025-12-24 · unverdicted · novelty 6.0

Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.

Calibrated Multimodal Representation Learning with Missing Modalities

cs.CV · 2025-11-15 · unverdicted · novelty 6.0

CalMRL mitigates anchor shift in multimodal representation learning by calibrating incomplete alignments through representation-level imputation of missing modalities using priors and a bi-step optimization with closed-form shared latent posteriors.

citing papers explorer

Showing 4 of 4 citing papers after filters.

  • Streaming Video Instruction Tuning cs.CV · 2025-12-24 · unverdicted · none · ref 3 · internal anchor

    Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.

  • Calibrated Multimodal Representation Learning with Missing Modalities cs.CV · 2025-11-15 · unverdicted · none · ref 79 · internal anchor

    CalMRL mitigates anchor shift in multimodal representation learning by calibrating incomplete alignments through representation-level imputation of missing modalities using priors and a bi-step optimization with closed-form shared latent posteriors.

  • Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding cs.CV · 2025-11-11 · conditional · none · ref 13 · internal anchor

    A plug-and-play Anonymizing Adapter Module removes private information from video latent features using self-supervised privacy objectives and consistency losses while retaining utility on action recognition, temporal detection, and anomaly tasks.

  • TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning cs.CV · 2025-12-03 · unverdicted · none · ref 56 · internal anchor

    TempR1 applies temporal-aware multi-task RL using GRPO and three types of localization rewards to achieve SOTA temporal understanding in MLLMs with synergistic gains from joint optimization.