Video (language) modeling: a baseline for generative models of natural videos
read the original abstract
We propose a strong baseline model for unsupervised feature learning using video data. By learning to predict missing frames or extrapolate future frames from an input video sequence, the model discovers both spatial and temporal correlations which are useful to represent complex deformations and motion patterns. The models we propose are largely borrowed from the language modeling literature, and adapted to the vision domain by quantizing the space of image patches into a large dictionary. We demonstrate the approach on both a filling and a generation task. For the first time, we show that, after training on natural videos, such a model can predict non-trivial motions over short video sequences.
This paper has not been read by Pith yet.
Forward citations
Cited by 6 Pith papers
-
Phenaki: Variable Length Video Generation From Open Domain Textual Description
Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images ...
-
Imagen Video: High Definition Video Generation with Diffusion Models
Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
-
Video Generators are Robot Policies
Training models to generate videos of robot actions produces policies that generalize better to new objects and tasks while using far less demonstration data than standard behavior cloning.
-
Frozen Forecasting: A Unified Evaluation
A new evaluation framework using latent diffusion on frozen vision backbones shows video-pretrained models consistently outperform image-based ones in forecasting entire trajectories across abstraction levels.
-
MagicVideo: Efficient Video Generation With Latent Diffusion Models
MagicVideo generates 256x256 text-conditioned video clips via latent diffusion with a custom 3D U-Net, achieving roughly 64 times lower compute than prior video diffusion models.
-
Multi-Modal World Model for Physical Robot Interactions: Simultaneous Visual and Tactile Predictions for Enhanced Accuracy
Visuo-tactile world models improve prediction accuracy in physically ambiguous robot-pushing scenarios, demonstrated on two new datasets with a magnetic tactile sensor.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.