Stochastic Variational Video Prediction

Chelsea Finn; Dumitru Erhan; Mohammad Babaeizadeh; Roy H. Campbell; Sergey Levine

arxiv: 1710.11252 · v2 · pith:VPQAMYOBnew · submitted 2017-10-30 · 💻 cs.CV · cs.RO

Stochastic Variational Video Prediction

Mohammad Babaeizadeh , Chelsea Finn , Dumitru Erhan , Roy H. Campbell , Sergey Levine This is my paper

classification 💻 cs.CV cs.RO

keywords stochasticreal-worldvideofuturepredictionmethodmodelimages

0 comments

read the original abstract

Predicting the future in real-world settings, particularly from raw sensory observations such as images, is exceptionally challenging. Real-world events can be stochastic and unpredictable, and the high dimensionality and complexity of natural images requires the predictive model to build an intricate understanding of the natural world. Many existing methods tackle this problem by making simplifying assumptions about the environment. One common assumption is that the outcome is deterministic and there is only one plausible future. This can lead to low-quality predictions in real-world settings with stochastic dynamics. In this paper, we develop a stochastic variational video prediction (SV2P) method that predicts a different possible future for each sample of its latent variables. To the best of our knowledge, our model is the first to provide effective stochastic multi-frame prediction for real-world video. We demonstrate the capability of the proposed method in predicting detailed future frames of videos on multiple real-world datasets, both action-free and action-conditioned. We find that our proposed method produces substantially improved video predictions when compared to the same model without stochasticity, and to other stochastic video prediction methods. Our SV2P implementation will be open sourced upon publication.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Point Tracking Improves World Action Models
cs.RO 2026-05 unverdicted novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
cs.CV 2026-05 unverdicted novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
Imagen Video: High Definition Video Generation with Diffusion Models
cs.CV 2022-10 unverdicted novelty 7.0

Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
Video Diffusion Models
cs.CV 2022-04 unverdicted novelty 7.0

A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance...
Mastering Atari with Discrete World Models
cs.LG 2020-10 accept novelty 7.0

DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
cs.CV 2023-08 unverdicted novelty 6.0

DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
VideoGPT: Video Generation using VQ-VAE and Transformers
cs.CV 2021-04 accept novelty 6.0

VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
Order Matters: Shuffling Sequence Generation for Video Prediction
cs.CV 2019-07 unverdicted novelty 6.0

SEE-Net improves video prediction by using frame shuffling to enforce learning of natural temporal order, reporting state-of-the-art results on three synthetic and real-world datasets.
Multi-Modal World Model for Physical Robot Interactions: Simultaneous Visual and Tactile Predictions for Enhanced Accuracy
cs.RO 2023-04 unverdicted novelty 5.0

Visuo-tactile world models improve prediction accuracy in physically ambiguous robot-pushing scenarios, demonstrated on two new datasets with a magnetic tactile sensor.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Frame forecasting in cine MRI using the PCA respiratory motion model: comparing recurrent neural networks trained online and transformers
eess.IV 2024-10 unverdicted novelty 4.0

Online RNNs (RTRL, SnAp-1) beat linear filters and transformers at medium-to-long horizon forecasting of PCA respiratory motion weights in two cine-MRI datasets, yielding sub-1.4 mm and sub-2.8 mm geometric errors.