Flexible Diffusion Modeling of Long Videos

Christian Weilbach; Frank Wood; Saeid Naderiparizi; Vaden Masrani; William Harvey

arxiv: 2205.11495 · v3 · pith:N6B6DBPVnew · submitted 2022-05-23 · 💻 cs.CV · cs.LG

Flexible Diffusion Modeling of Long Videos

William Harvey , Saeid Naderiparizi , Vaden Masrani , Christian Weilbach , Frank Wood This is my paper

classification 💻 cs.CV cs.LG

keywords videomodelingframesvideosdiffusionlongsamplesampled

0 comments

read the original abstract

We present a framework for video modeling based on denoising diffusion probabilistic models that produces long-duration video completions in a variety of realistic environments. We introduce a generative model that can at test-time sample any arbitrary subset of video frames conditioned on any other subset and present an architecture adapted for this purpose. Doing so allows us to efficiently compare and optimize a variety of schedules for the order in which frames in a long video are sampled and use selective sparse and long-range conditioning on previously sampled frames. We demonstrate improved video modeling over prior work on a number of datasets and sample temporally coherent videos over 25 minutes in length. We additionally release a new video modeling dataset and semantically meaningful metrics based on videos generated in the CARLA autonomous driving simulator.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
cs.LG 2022-09 unverdicted novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
Phenaki: Variable Length Video Generation From Open Domain Textual Description
cs.CV 2022-10 unverdicted novelty 7.0

Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images ...
Imagen Video: High Definition Video Generation with Diffusion Models
cs.CV 2022-10 unverdicted novelty 7.0

Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
Feed-forward Motion In-betweening for Any 4D
cs.CV 2026-06 unverdicted novelty 6.0

Proposes a feed-forward keyframe-conditioned in-betweening method for arbitrary 4D meshes using a topology-agnostic VAE and MMDiT-based rectified flow model.
DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization
cs.CV 2024-12 unverdicted novelty 6.0

DOLLAR combines variational score and consistency distillation for few-step video generation plus latent reward optimization, reporting 82.57 VBench score and up to 278x speedup over the teacher diffusion model for 12...
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
cs.CV 2023-08 unverdicted novelty 6.0

DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
MagicVideo: Efficient Video Generation With Latent Diffusion Models
cs.CV 2022-11 unverdicted novelty 6.0

MagicVideo generates 256x256 text-conditioned video clips via latent diffusion with a custom 3D U-Net, achieving roughly 64 times lower compute than prior video diffusion models.
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
cs.CV 2022-11 unverdicted novelty 6.0

An ensemble of stage-specialized text-to-image diffusion models improves prompt alignment over single shared-parameter models while preserving visual quality and inference speed.
Probabilistic Precipitation Nowcasting with Rectified Flow Transformers
cs.CV 2026-05 unverdicted novelty 5.0

FREUD applies rectified flow transformers with frame-wise encoding and a unified decoder to achieve state-of-the-art probabilistic precipitation nowcasting on the SEVIR benchmark.
VRAG: Learning World Models for Interactive Video Generation
cs.CV 2025-05 unverdicted novelty 5.0

The work introduces video retrieval augmented generation (VRAG) with explicit global state conditioning to reduce compounding errors and improve spatiotemporal consistency in interactive video world models.
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
cs.CV 2023-11 unverdicted novelty 5.0

I2VGen-XL applies cascaded diffusion models with a base stage for semantic preservation via hierarchical encoders and a refinement stage for detail and resolution, trained on 35 million text-video and 6 billion text-i...
ModelScope Text-to-Video Technical Report
cs.CV 2023-08 unverdicted novelty 4.0

ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.