hub Canonical reference

Video Diffusion Models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, David J. Fleet · 2022 · cs.CV · arXiv 2204.03458

Canonical reference. 100% of citing Pith papers cite this work as background.

60 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 60 citing papers arXiv PDF

abstract

Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video generation that shows very promising initial results. Our model is a natural extension of the standard image diffusion architecture, and it enables jointly training from image and video data, which we find to reduce the variance of minibatch gradients and speed up optimization. To generate long and higher resolution videos we introduce a new conditional sampling technique for spatial and temporal video extension that performs better than previously proposed methods. We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on established benchmarks for video prediction and unconditional video generation. Supplementary material is available at https://video-diffusion.github.io/

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10

citation-polarity summary

background 10

representative citing papers

MusicLM: Generating Music From Text

cs.SD · 2023-01-26 · conditional · novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

cs.LG · 2022-09-07 · unverdicted · novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

FadeMem introduces distance-aware KV memory consolidation for autoregressive video diffusion that builds a temporal hierarchy with power-law merging to preserve short-term dynamics and long-range coherence under fixed cache budget.

Functionalization via Structure Completion and Motion Rectification

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.

StreamingEffect: Real-Time Human-Centric Video Effect Generation

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.

$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a directional derivative penalty.

AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe

cs.MM · 2026-04-22 · unverdicted · novelty 7.0

AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.

Speculative Decoding for Autoregressive Video Generation

cs.CV · 2026-04-19 · conditional · novelty 7.0

A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% quality retention.

MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation

cs.GR · 2026-04-08 · unverdicted · novelty 7.0

MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.

Score Shocks: The Burgers Equation Structure of Diffusion Generative Models

cond-mat.stat-mech · 2026-04-08 · unverdicted · novelty 7.0

The score in diffusion models obeys viscous Burgers dynamics, with binary mode boundaries producing a universal tanh interfacial profile whose sharpening marks speciation transitions.

Physics-Aware Video Instance Removal Benchmark

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

The PVIR benchmark tests video object removal on physical consistency using 95 annotated videos and shows that existing methods struggle with complex interactions like lingering shadows.

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

cs.RO · 2023-10-16 · conditional · novelty 7.0

SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.

Learning Interactive Real-World Simulators

cs.AI · 2023-10-09 · conditional · novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

Phenaki: Variable Length Video Generation From Open Domain Textual Description

cs.CV · 2022-10-05 · unverdicted · novelty 7.0

Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images and videos.

Imagen Video: High Definition Video Generation with Diffusion Models

cs.CV · 2022-10-05 · unverdicted · novelty 7.0

Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.

DreamFusion: Text-to-3D using 2D Diffusion

cs.CV · 2022-09-29 · accept · novelty 7.0

Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.

Human Motion Diffusion Model

cs.CV · 2022-09-29 · unverdicted · novelty 7.0

MDM is a classifier-free diffusion model that generates expressive human motions by predicting clean samples rather than noise, supporting text and action conditioning and outperforming prior methods on standard benchmarks.

Diffusion Posterior Sampling for General Noisy Inverse Problems

stat.ML · 2022-09-29 · unverdicted · novelty 7.0

Diffusion models solve noisy (non)linear inverse problems via approximated posterior sampling that blends diffusion steps with manifold gradients without strict consistency projection.

Pixel Cube: Diffusion-based Portrait Video Relighting Through Realistic Lighting Reproduction

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

Diffusion model trained on hybrid real-rendered portrait videos uses HDR environment maps and background images for photorealistic, temporally consistent video relighting.

MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

MORPHOS introduces an autoregressive 4D generation method with Temporal Structured Latents (T-SLAT) that produces dynamic 3D assets from videos while handling topological changes and long sequences.

BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers

cs.LG · 2026-06-01 · unverdicted · novelty 6.0

BlockGen enables flexible blockwise diffusion modeling with mixed block sizes and ARPC sampling, finding uniform diffusion outperforms masked under ancestral sampling in few-step regimes while the gap reverses with ARPC at high NFE.

A Systematic Study of Behavioral Cloning for Scientific Data Annotation

cs.HC · 2026-05-26 · unverdicted · novelty 6.0

Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.

Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

cs.CV · 2026-05-01 · unverdicted · novelty 6.0

UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.

citing papers explorer

Showing 30 of 30 citing papers after filters.

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion cs.CV · 2026-06-09 · unverdicted · none · ref 48 · internal anchor
FadeMem introduces distance-aware KV memory consolidation for autoregressive video diffusion that builds a temporal hierarchy with power-law merging to preserve short-term dynamics and long-range coherence under fixed cache budget.
Functionalization via Structure Completion and Motion Rectification cs.CV · 2026-05-18 · unverdicted · none · ref 118 · internal anchor
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
StreamingEffect: Real-Time Human-Centric Video Effect Generation cs.CV · 2026-05-16 · unverdicted · none · ref 23 · internal anchor
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models cs.CV · 2026-04-26 · unverdicted · none · ref 11 · internal anchor
Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a directional derivative penalty.
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe cs.MM · 2026-04-22 · unverdicted · none · ref 33 · internal anchor
AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.
Speculative Decoding for Autoregressive Video Generation cs.CV · 2026-04-19 · conditional · none · ref 5 · internal anchor
A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% quality retention.
MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation cs.GR · 2026-04-08 · unverdicted · none · ref 12 · internal anchor
MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.
Score Shocks: The Burgers Equation Structure of Diffusion Generative Models cond-mat.stat-mech · 2026-04-08 · unverdicted · none · ref 2 · internal anchor
The score in diffusion models obeys viscous Burgers dynamics, with binary mode boundaries producing a universal tanh interfacial profile whose sharpening marks speciation transitions.
Physics-Aware Video Instance Removal Benchmark cs.CV · 2026-04-07 · unverdicted · none · ref 7 · internal anchor
The PVIR benchmark tests video object removal on physical consistency using 95 annotated videos and shows that existing methods struggle with complex interactions like lingering shadows.
Pixel Cube: Diffusion-based Portrait Video Relighting Through Realistic Lighting Reproduction cs.CV · 2026-06-01 · unverdicted · none · ref 12 · internal anchor
Diffusion model trained on hybrid real-rendered portrait videos uses HDR environment maps and background images for photorealistic, temporally consistent video relighting.
MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents cs.CV · 2026-06-01 · unverdicted · none · ref 15 · internal anchor
MORPHOS introduces an autoregressive 4D generation method with Temporal Structured Latents (T-SLAT) that produces dynamic 3D assets from videos while handling topological changes and long sequences.
BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers cs.LG · 2026-06-01 · unverdicted · none · ref 127 · internal anchor
BlockGen enables flexible blockwise diffusion modeling with mixed block sizes and ARPC sampling, finding uniform diffusion outperforms masked under ancestral sampling in few-step regimes while the gap reverses with ARPC at high NFE.
A Systematic Study of Behavioral Cloning for Scientific Data Annotation cs.HC · 2026-05-26 · unverdicted · none · ref 266 · internal anchor
Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction cs.CV · 2026-05-14 · unverdicted · none · ref 33 · internal anchor
CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors cs.CV · 2026-05-01 · unverdicted · none · ref 82 · internal anchor
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion cs.CV · 2026-04-22 · unverdicted · none · ref 6 · internal anchor
DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with a semantic motion router.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction cs.CV · 2026-04-13 · unverdicted · none · ref 30 · internal anchor
Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
3D Scene-Adaptive Trajectory-Controllable Human Image Animation with Camera Movement cs.CV · 2026-06-29 · unverdicted · none · ref 9 · 2 links · internal anchor
Presents a scene-adaptive 3D human image animation framework using ground-adaptive motion retargeting and viewpoint-adaptive latent fusion to control human and camera trajectories, claiming improvements on two benchmarks.
BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension cs.CV · 2026-06-07 · unverdicted · none · ref 2 · internal anchor
BioVid is a data-driven autoregressive model using 2D-encode/3D-decode tokenization and causal Transformer with EOS termination that reproduces real action duration distributions (W1 distance 1.24 frames) on NTU RGB+D drinking clips, outperforming fixed-length baselines.
Improving Visual Representation Alignment Generation with GRPO cs.CV · 2026-05-30 · unverdicted · none · ref 10 · internal anchor
VRPO applies generative representation policy optimization to dynamically align diffusion features with pretrained visual encoders, claiming +1.8 FID gains and 2.3x faster training versus REPA.
DisagFusion: Asynchronous Pipeline Parallelism and Elastic Scheduling for Disaggregated Diffusion Serving cs.DC · 2026-05-25 · unverdicted · none · ref 8 · internal anchor
DisagFusion achieves 3.4x-20.5x higher throughput and 18.5x lower latency for diffusion serving via asynchronous pipeline parallelism and elastic hybrid scheduling on disaggregated hardware.
{\Phi}-Noise: Training-Free Temporal Video Conditioning via Phase-Based Noise Manipulation cs.CV · 2026-05-23 · unverdicted · none · ref 23 · internal anchor
Training-free motion conditioning for latent video diffusion by direct injection of low-frequency phase from a reference video into the diffusion noise.
DriveCtrl: Conditioned Sim-to-Real Driving Video Generation cs.CV · 2026-05-14 · unverdicted · none · ref 20 · internal anchor
DriveCtrl is a depth-conditioned controllable framework that generates realistic driving videos from simulation while preserving annotations and scene dynamics.
World Action Models: The Next Frontier in Embodied AI cs.RO · 2026-05-12 · unverdicted · none · ref 290 · internal anchor
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Watching Physics: the Generative Science of Matter and Motion cs.CE · 2026-04-18 · unverdicted · none · ref 15 · internal anchor
Generative video models recover physical quantities like surface strain from visible motion when coupled with experiments and simulations, but fail when internal variables dominate, defining a new Generative Science of Matter and Motion.
Discrete Meanflow Training Curriculum cs.LG · 2026-04-10 · unverdicted · none · ref 7 · internal anchor
A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation cs.CV · 2026-02-14 · unverdicted · none · ref 38 · internal anchor
EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consistency and audio-lip sync.
Collaborative Few-Step Distillation and Low-Bit Quantization for Wan2.2 Dual-Expert Video Diffusion Models cs.CV · 2026-05-30 · unverdicted · none · ref 3 · internal anchor
A co-designed few-step distillation and low-bit quantization pipeline for Wan2.2-T2V-A14B keeps quantized few-step performance close to or above the full-precision baseline at 8 and 20 steps.
Introduction to Stochastic Differential Equations for Generative Machine Learning: A Variational Perspective cs.LG · 2026-06-30 · unverdicted · none · ref 7 · internal anchor
An expository tutorial deriving the ELBO for SDE-based generative models and presenting diffusion, score, and flow matching as variational parameterizations illustrated on a 1D example.
Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling cs.AI · 2026-05-01 · unreviewed · ref 10 · 2 links · internal anchor

Video Diffusion Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer