MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
hub Canonical reference
Video Diffusion Models
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video generation that shows very promising initial results. Our model is a natural extension of the standard image diffusion architecture, and it enables jointly training from image and video data, which we find to reduce the variance of minibatch gradients and speed up optimization. To generate long and higher resolution videos we introduce a new conditional sampling technique for spatial and temporal video extension that performs better than previously proposed methods. We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on established benchmarks for video prediction and unconditional video generation. Supplementary material is available at https://video-diffusion.github.io/
hub tools
citation-role summary
citation-polarity summary
fields
cs.CV 31 cs.RO 5 cs.LG 3 cs.AI 2 cond-mat.stat-mech 1 cs.CE 1 cs.GR 1 cs.MM 1 cs.SD 1 stat.ML 1roles
background 10polarities
background 10representative citing papers
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a directional derivative penalty.
AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.
A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% quality retention.
MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.
The score in diffusion models obeys viscous Burgers dynamics, with binary mode boundaries producing a universal tanh interfacial profile whose sharpening marks speciation transitions.
The PVIR benchmark tests video object removal on physical consistency using 95 annotated videos and shows that existing methods struggle with complex interactions like lingering shadows.
SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images and videos.
Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.
MDM is a classifier-free diffusion model that generates expressive human motions by predicting clean samples rather than noise, supporting text and action conditioning and outperforming prior methods on standard benchmarks.
Diffusion models solve noisy (non)linear inverse problems via approximated posterior sampling that blends diffusion steps with manifold gradients without strict consistency projection.
CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with a semantic motion router.
Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
A factorized modular diffusion policy improves fitting of multimodal robot actions and enables flexible task adaptation without catastrophic forgetting.
Flow Marching jointly samples noise and physical time to learn a velocity field for generative PDE modeling, paired with a latent autoencoder and efficient transformer for large-scale pretraining on 2.5M trajectories.
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
citing papers explorer
-
MusicLM: Generating Music From Text
MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
-
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
-
Functionalization via Structure Completion and Motion Rectification
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
-
StreamingEffect: Real-Time Human-Centric Video Effect Generation
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
-
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a directional derivative penalty.
-
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.
-
Speculative Decoding for Autoregressive Video Generation
A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% quality retention.
-
MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation
MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.
-
Score Shocks: The Burgers Equation Structure of Diffusion Generative Models
The score in diffusion models obeys viscous Burgers dynamics, with binary mode boundaries producing a universal tanh interfacial profile whose sharpening marks speciation transitions.
-
Physics-Aware Video Instance Removal Benchmark
The PVIR benchmark tests video object removal on physical consistency using 95 annotated videos and shows that existing methods struggle with complex interactions like lingering shadows.
-
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
-
Learning Interactive Real-World Simulators
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
-
Phenaki: Variable Length Video Generation From Open Domain Textual Description
Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images and videos.
-
Imagen Video: High Definition Video Generation with Diffusion Models
Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
-
DreamFusion: Text-to-3D using 2D Diffusion
Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.
-
Human Motion Diffusion Model
MDM is a classifier-free diffusion model that generates expressive human motions by predicting clean samples rather than noise, supporting text and action conditioning and outperforming prior methods on standard benchmarks.
-
Diffusion Posterior Sampling for General Noisy Inverse Problems
Diffusion models solve noisy (non)linear inverse problems via approximated posterior sampling that blends diffusion steps with manifold gradients without strict consistency projection.
-
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction
CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
-
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
-
DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion
DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with a semantic motion router.
-
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
-
Flexible Multitask Learning with Factorized Diffusion Policy
A factorized modular diffusion policy improves fitting of multimodal robot actions and enables flexible task adaptation without catastrophic forgetting.
-
Flow marching for a generative PDE foundation model
Flow Marching jointly samples noise and physical time to learn a velocity field for generative PDE modeling, paired with a latent autoencoder and efficient transformer for large-scale pretraining on 2.5M trajectories.
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
-
We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback
NeuS-E is a post-generation refinement method that uses neuro-symbolic analysis of a formal video representation to detect and correct semantic and temporal inconsistencies in text-to-video outputs, improving prompt alignment by nearly 40%.
-
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generalizable policies than imitation learning alone.
-
DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization
DOLLAR combines variational score and consistency distillation for few-step video generation plus latent reward optimization, reporting 82.57 VBench score and up to 278x speedup over the teacher diffusion model for 128-frame 10-second videos.
-
CAT3D: Create Anything in 3D with Multi-View Diffusion Models
A multi-view diffusion model generates consistent novel views from sparse images to enable fast 3D scene reconstruction.
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.
-
Improved DDIM Sampling with Moment Matching Gaussian Mixtures
Moment-matched GMM kernels in DDIM yield lower FID and higher IS than Gaussian kernels at small sampling steps on CelebA-HQ, FFHQ, ImageNet, and Stable Diffusion tasks.
-
Shap-E: Generating Conditional 3D Implicit Functions
Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.
-
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Latent-space hierarchical diffusion models with targeted error-correction techniques generate realistic videos exceeding 1000 frames while using less compute than prior pixel-space approaches.
-
MagicVideo: Efficient Video Generation With Latent Diffusion Models
MagicVideo generates 256x256 text-conditioned video clips via latent diffusion with a custom 3D U-Net, achieving roughly 64 times lower compute than prior video diffusion models.
-
Make-A-Video: Text-to-Video Generation without Text-Video Data
Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.
-
BADiff: Bandwidth Adaptive Diffusion Model
BADiff introduces joint training of diffusion models with quality conditioning derived from bandwidth to enable adaptive early-stop sampling that preserves appropriate perceptual quality.
-
DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment
DriVerse is a generative model that simulates driving scenes from an image and trajectory using multimodal prompting and motion alignment, achieving better performance on nuScenes and Waymo datasets with minimal training.
-
Wan: Open and Advanced Large-Scale Video Generative Models
Wan releases open 1.3B and 14B video diffusion models claiming superior performance over open-source and commercial baselines across multiple tasks with consumer-grade efficiency.
-
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
I2VGen-XL applies cascaded diffusion models with a base stage for semantic preservation via hierarchical encoders and a refinement stage for detail and resolution, trained on 35 million text-video and 6 billion text-image pairs.
-
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
CogVideo is a large-scale transformer pretrained for text-to-video generation that outperforms public models in evaluations.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
Watching Physics: the Generative Science of Matter and Motion
Generative video models recover physical quantities like surface strain from visible motion when coupled with experiments and simulations, but fail when internal variables dominate, defining a new Generative Science of Matter and Motion.
-
Discrete Meanflow Training Curriculum
A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.
-
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consistency and audio-lip sync.
-
ModelScope Text-to-Video Technical Report
ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.
- Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling
- VRAG: Learning World Models for Interactive Video Generation