Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images and videos.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.
Proposes cINN architecture for conditional image generation that by construction yields diverse sharp samples, demonstrated on MNIST digit generation and image colorization with latent space manipulation.
Latent-space hierarchical diffusion models with targeted error-correction techniques generate realistic videos exceeding 1000 frames while using less compute than prior pixel-space approaches.
A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.
citing papers explorer
-
Phenaki: Variable Length Video Generation From Open Domain Textual Description
Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images and videos.
-
Video Diffusion Models
A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.
-
Guided Image Generation with Conditional Invertible Neural Networks
Proposes cINN architecture for conditional image generation that by construction yields diverse sharp samples, demonstrated on MNIST digit generation and image colorization with latent space manipulation.
-
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Latent-space hierarchical diffusion models with targeted error-correction techniques generate realistic videos exceeding 1000 frames while using less compute than prior pixel-space approaches.
-
A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models
A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.