hub Mixed citations

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao · 2022 · cs.CV · arXiv 2212.03191

Mixed citation behavior. Most common role is background (60%).

41 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 41 citing papers arXiv PDF

abstract

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 1 method 1

citation-polarity summary

background 3 baseline 1 use method 1

representative citing papers

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

cs.CV · 2026-06-11 · unverdicted · novelty 7.0

CineOrchestra unifies control of subjects, events, cameras, and shot transitions in cinematic video generation through entity-centric conditioning primitives and parameter-free coordinated rotary embeddings.

CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

cs.CV · 2026-06-08 · unverdicted · novelty 7.0

CapRL++ applies reinforcement learning with verifiable rewards to dense image and video captioning by scoring captions via the accuracy of a vision-free LLM answering MCQs from the caption alone.

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

cs.CV · 2026-04-30 · unverdicted · novelty 7.0

TransVLM formalizes Shot Transition Detection as identifying full temporal transition segments rather than single cut points and introduces a VLM that injects optical flow as a motion prior via simple feature fusion, plus a synthetic data engine and benchmark.

Training-Free Semantic Multi-Object Tracking with Vision-Language Models

cs.CV · 2026-04-15 · conditional · novelty 7.0

TF-SMOT composes pretrained vision-language models into a training-free pipeline that reaches state-of-the-art tracking and improved summary quality on the BenSMOT benchmark.

V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

V-Nutri fuses final-dish features with cooking-process keyframes from egocentric videos to improve dish-level calorie and macronutrient estimation over single-image baselines.

InstrAct: Towards Action-Centric Understanding in Instructional Videos

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on a new InstrAct Bench for semantic, procedural, and retrieval tasks.

A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

Fully end-to-end training with a sentence-conditioned adapter outperforms frozen-backbone baselines for localizing video segments that match sentence queries.

LRM: Large Reconstruction Model for Single Image to 3D

cs.CV · 2023-11-08 · conditional · novelty 7.0

LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.

VideoChat: Chat-Centric Video Understanding

cs.CV · 2023-05-10 · conditional · novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

T-MOR: Learning Motion-Aware Skeleton Representations for Human Action Recognition

cs.CV · 2026-06-19 · unverdicted · novelty 6.0

T-MOR is a multi-modal contrastive framework that pre-trains transferable skeleton motion representations using a new 1M video-skeleton-text dataset and shows gains on action classification and temporal detection benchmarks plus few/zero-shot settings.

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

cs.IR · 2026-06-18 · unverdicted · novelty 6.0

ELVA applies ranking-driven RLVR to multimodal retrieval to reduce grain blindness in contrastive learning, reporting SOTA results and a 13.1% gain on the new MRBench benchmark.

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

cs.CV · 2026-06-11 · unverdicted · novelty 6.0

HYDRA-X presents the first unified multimodal model using a single ViT for holistic image-video tokenization, with ablations on attention and compression plus a latent-level editing improvement.

OmniGen-AR: AutoRegressive Any-to-Image Generation

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

OmniGen-AR is a unified autoregressive framework for any-to-image generation that tokenizes text and visual conditions together and uses disentangled causal attention to support tasks like text-to-image, depth-to-image, image editing, and text-to-video while reporting 0.63 on GenEval and 80.02 on VB

Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding

cs.CV · 2026-06-05 · unverdicted · novelty 6.0

LyraV uses FDTC and SToP for per-frame incremental decoding to reach 98.29% video synchrony at 3.89 FPS while preserving general understanding.

Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

Reaction-Diffusion Multimodal Fusion (RDMF) applies the Gray-Scott model to video-text alignment for language-guided moment retrieval, claiming better adaptive modeling than static attention.

SlotMemory: Object-Centric KV Memory for Streaming Long-Video Generation

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

SlotMemory decomposes transformer KV into discrete semantic slots for entity-level persistence in streaming long-video generation, reporting 81.61 quality and 22.8% dynamic consistency gain on 60-second interactive videos.

Towards Unified Vision-Language Models with Incomplete Multi-Modal Inputs

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

Proposes the first unified incomplete video-language model that processes missing modalities and serves as a plug-and-play module to boost existing VLMs on multi-modal tasks.

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.

FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

FreqFormer applies heterogeneous attention (dense global on low frequencies, block-sparse on mid, local on high) plus adaptive spectral routing to reduce attention cost in long-sequence video diffusion transformers.

UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

UniversalVTG is a lightweight foundation model for video temporal grounding that achieves state-of-the-art results across five benchmarks while being over 100 times smaller than recent MLLM-based methods.

Streaming Video Instruction Tuning

cs.CV · 2025-12-24 · unverdicted · novelty 6.0

Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.

Calibrated Multimodal Representation Learning with Missing Modalities

cs.CV · 2025-11-15 · unverdicted · novelty 6.0

CalMRL mitigates anchor shift in multimodal representation learning by calibrating incomplete alignments through representation-level imputation of missing modalities using priors and a bi-step optimization with closed-form shared latent posteriors.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer