MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
hub Canonical reference
Video Diffusion Models
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video generation that shows very promising initial results. Our model is a natural extension of the standard image diffusion architecture, and it enables jointly training from image and video data, which we find to reduce the variance of minibatch gradients and speed up optimization. To generate long and higher resolution videos we introduce a new conditional sampling technique for spatial and temporal video extension that performs better than previously proposed methods. We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on established benchmarks for video prediction and unconditional video generation. Supplementary material is available at https://video-diffusion.github.io/
hub tools
citation-role summary
citation-polarity summary
fields
cs.CV 40 cs.LG 5 cs.RO 5 cs.AI 2 cond-mat.stat-mech 1 cs.CE 1 cs.DC 1 cs.GR 1 cs.HC 1 cs.MM 1roles
background 10polarities
background 10representative citing papers
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
FadeMem introduces distance-aware KV memory consolidation for autoregressive video diffusion that builds a temporal hierarchy with power-law merging to preserve short-term dynamics and long-range coherence under fixed cache budget.
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a directional derivative penalty.
AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.
A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% quality retention.
MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.
The score in diffusion models obeys viscous Burgers dynamics, with binary mode boundaries producing a universal tanh interfacial profile whose sharpening marks speciation transitions.
The PVIR benchmark tests video object removal on physical consistency using 95 annotated videos and shows that existing methods struggle with complex interactions like lingering shadows.
SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images and videos.
Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.
MDM is a classifier-free diffusion model that generates expressive human motions by predicting clean samples rather than noise, supporting text and action conditioning and outperforming prior methods on standard benchmarks.
Diffusion models solve noisy (non)linear inverse problems via approximated posterior sampling that blends diffusion steps with manifold gradients without strict consistency projection.
Diffusion model trained on hybrid real-rendered portrait videos uses HDR environment maps and background images for photorealistic, temporally consistent video relighting.
MORPHOS introduces an autoregressive 4D generation method with Temporal Structured Latents (T-SLAT) that produces dynamic 3D assets from videos while handling topological changes and long sequences.
BlockGen enables flexible blockwise diffusion modeling with mixed block sizes and ARPC sampling, finding uniform diffusion outperforms masked under ancestral sampling in few-step regimes while the gap reverses with ARPC at high NFE.
Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.
CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
citing papers explorer
-
FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion
FadeMem introduces distance-aware KV memory consolidation for autoregressive video diffusion that builds a temporal hierarchy with power-law merging to preserve short-term dynamics and long-range coherence under fixed cache budget.
-
Functionalization via Structure Completion and Motion Rectification
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
-
StreamingEffect: Real-Time Human-Centric Video Effect Generation
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
-
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a directional derivative penalty.
-
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.
-
Speculative Decoding for Autoregressive Video Generation
A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% quality retention.
-
MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation
MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.
-
Score Shocks: The Burgers Equation Structure of Diffusion Generative Models
The score in diffusion models obeys viscous Burgers dynamics, with binary mode boundaries producing a universal tanh interfacial profile whose sharpening marks speciation transitions.
-
Physics-Aware Video Instance Removal Benchmark
The PVIR benchmark tests video object removal on physical consistency using 95 annotated videos and shows that existing methods struggle with complex interactions like lingering shadows.
-
Pixel Cube: Diffusion-based Portrait Video Relighting Through Realistic Lighting Reproduction
Diffusion model trained on hybrid real-rendered portrait videos uses HDR environment maps and background images for photorealistic, temporally consistent video relighting.
-
MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents
MORPHOS introduces an autoregressive 4D generation method with Temporal Structured Latents (T-SLAT) that produces dynamic 3D assets from videos while handling topological changes and long sequences.
-
BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers
BlockGen enables flexible blockwise diffusion modeling with mixed block sizes and ARPC sampling, finding uniform diffusion outperforms masked under ancestral sampling in few-step regimes while the gap reverses with ARPC at high NFE.
-
A Systematic Study of Behavioral Cloning for Scientific Data Annotation
Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.
-
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction
CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
-
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
-
DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion
DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with a semantic motion router.
-
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
-
3D Scene-Adaptive Trajectory-Controllable Human Image Animation with Camera Movement
Presents a scene-adaptive 3D human image animation framework using ground-adaptive motion retargeting and viewpoint-adaptive latent fusion to control human and camera trajectories, claiming improvements on two benchmarks.
-
BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension
BioVid is a data-driven autoregressive model using 2D-encode/3D-decode tokenization and causal Transformer with EOS termination that reproduces real action duration distributions (W1 distance 1.24 frames) on NTU RGB+D drinking clips, outperforming fixed-length baselines.
-
Improving Visual Representation Alignment Generation with GRPO
VRPO applies generative representation policy optimization to dynamically align diffusion features with pretrained visual encoders, claiming +1.8 FID gains and 2.3x faster training versus REPA.
-
DisagFusion: Asynchronous Pipeline Parallelism and Elastic Scheduling for Disaggregated Diffusion Serving
DisagFusion achieves 3.4x-20.5x higher throughput and 18.5x lower latency for diffusion serving via asynchronous pipeline parallelism and elastic hybrid scheduling on disaggregated hardware.
-
{\Phi}-Noise: Training-Free Temporal Video Conditioning via Phase-Based Noise Manipulation
Training-free motion conditioning for latent video diffusion by direct injection of low-frequency phase from a reference video into the diffusion noise.
-
DriveCtrl: Conditioned Sim-to-Real Driving Video Generation
DriveCtrl is a depth-conditioned controllable framework that generates realistic driving videos from simulation while preserving annotations and scene dynamics.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
Watching Physics: the Generative Science of Matter and Motion
Generative video models recover physical quantities like surface strain from visible motion when coupled with experiments and simulations, but fail when internal variables dominate, defining a new Generative Science of Matter and Motion.
-
Discrete Meanflow Training Curriculum
A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.
-
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consistency and audio-lip sync.
-
Collaborative Few-Step Distillation and Low-Bit Quantization for Wan2.2 Dual-Expert Video Diffusion Models
A co-designed few-step distillation and low-bit quantization pipeline for Wan2.2-T2V-A14B keeps quantized few-step performance close to or above the full-precision baseline at 8 and 20 steps.
-
Introduction to Stochastic Differential Equations for Generative Machine Learning: A Variational Perspective
An expository tutorial deriving the ELBO for SDE-based generative models and presenting diffusion, score, and flow matching as variational parameterizations illustrated on a 1D example.
- Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling