Video (language) modeling: a baseline for generative models of natural videos

Arthur Szlam; Joan Bruna; MarcAurelio Ranzato; Michael Mathieu; Ronan Collobert; Sumit Chopra

arxiv: 1412.6604 · v5 · pith:LOAS77FFnew · submitted 2014-12-20 · 💻 cs.LG · cs.CV

Video (language) modeling: a baseline for generative models of natural videos

MarcAurelio Ranzato , Arthur Szlam , Joan Bruna , Michael Mathieu , Ronan Collobert , Sumit Chopra This is my paper

classification 💻 cs.LG cs.CV

keywords videomodelbaselineframeslanguagelearningmodelingmodels

0 comments

read the original abstract

We propose a strong baseline model for unsupervised feature learning using video data. By learning to predict missing frames or extrapolate future frames from an input video sequence, the model discovers both spatial and temporal correlations which are useful to represent complex deformations and motion patterns. The models we propose are largely borrowed from the language modeling literature, and adapted to the vision domain by quantizing the space of image patches into a large dictionary. We demonstrate the approach on both a filling and a generation task. For the first time, we show that, after training on natural videos, such a model can predict non-trivial motions over short video sequences.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Phenaki: Variable Length Video Generation From Open Domain Textual Description
cs.CV 2022-10 unverdicted novelty 7.0

Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images ...
Imagen Video: High Definition Video Generation with Diffusion Models
cs.CV 2022-10 unverdicted novelty 7.0

Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
Video Generators are Robot Policies
cs.RO 2025-08 conditional novelty 6.0

Training models to generate videos of robot actions produces policies that generalize better to new objects and tasks while using far less demonstration data than standard behavior cloning.
Frozen Forecasting: A Unified Evaluation
cs.CV 2025-07 unverdicted novelty 6.0

A new evaluation framework using latent diffusion on frozen vision backbones shows video-pretrained models consistently outperform image-based ones in forecasting entire trajectories across abstraction levels.
MagicVideo: Efficient Video Generation With Latent Diffusion Models
cs.CV 2022-11 unverdicted novelty 6.0

MagicVideo generates 256x256 text-conditioned video clips via latent diffusion with a custom 3D U-Net, achieving roughly 64 times lower compute than prior video diffusion models.
Multi-Modal World Model for Physical Robot Interactions: Simultaneous Visual and Tactile Predictions for Enhanced Accuracy
cs.RO 2023-04 unverdicted novelty 5.0

Visuo-tactile world models improve prediction accuracy in physically ambiguous robot-pushing scenarios, demonstrated on two new datasets with a magnetic tactile sensor.