Latent Video Diffusion Models for High-Fidelity Long Video Generation
Pith reviewed 2026-05-15 04:23 UTC · model grok-4.3
The pith
Video diffusion models shift to a low-dimensional 3D latent space to generate realistic clips longer than 1000 frames with modest compute.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, which significantly outperforms previous pixel-space video diffusion models under a limited computational budget. We propose hierarchical diffusion in the latent space to produce longer videos with more than one thousand frames. Conditional latent perturbation and unconditional guidance are added to mitigate accumulated errors during video length extension.
What carries the argument
Low-dimensional 3D latent space for the diffusion process, together with hierarchical diffusion, conditional latent perturbation, and unconditional guidance.
If this is right
- Videos exceeding 1000 frames become feasible without proportional growth in required computation.
- Output realism exceeds that of prior pixel-space diffusion models when compute is constrained.
- Conditional latent perturbation and unconditional guidance reduce error buildup over extended sequences.
- The framework scales to large-scale text-to-video tasks while preserving the efficiency gains.
- Results hold across small domain-specific datasets of varied categories.
Where Pith is reading between the lines
- The same latent compression and hierarchy might enable real-time or on-device video synthesis on consumer hardware.
- Hierarchical latent diffusion could transfer to related tasks such as long audio generation or sequential image synthesis.
- Future checks could verify whether fine motion details survive repeated latent compression and extension steps.
- Pairing the approach with existing video codecs might push feasible sequence lengths even further.
Load-bearing premise
The compressed 3D latent space retains enough spatial-temporal information to allow high-fidelity video generation without irreversible detail loss.
What would settle it
Train the model on a held-out dataset, generate sequences exceeding 1000 frames, and measure whether visual artifacts or temporal inconsistencies appear that are absent in equivalent pixel-space diffusion runs at higher compute cost.
read the original abstract
AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes latent video diffusion models operating in a low-dimensional 3D latent space to enable lightweight, high-fidelity video generation that outperforms pixel-space baselines under limited compute. It introduces hierarchical diffusion to produce videos exceeding 1000 frames and conditional latent perturbation plus unconditional guidance to mitigate error accumulation during length extension. Claims are supported by qualitative results on small-domain datasets across categories plus a text-to-video extension.
Significance. If the central claims hold under rigorous evaluation, the work would advance efficient generative video modeling by showing how latent-space diffusion can reduce computational cost while scaling to long sequences, addressing key bottlenecks in current video diffusion approaches.
major comments (3)
- [Experiments] Experiments section: the central claim of outperforming prior pixel-space video diffusion models rests on qualitative comparisons and 'extensive experiments' on small-domain datasets, but the manuscript supplies no quantitative metrics (e.g., FVD, FID, PSNR), error bars, ablation tables, or explicit baseline specifications, leaving the outperformance assertion only partially supported.
- [§3.1] §3.1 (Video Autoencoder and latent space): the low-dimensional 3D latent representation is load-bearing for both efficiency and fidelity claims, yet no reconstruction metrics, latent-dimension ablations, or spatio-temporal detail preservation analysis are reported; without these, it is unclear whether critical high-frequency or temporal information is retained.
- [§4.3] §4.3 (Long-video extension): conditional latent perturbation and unconditional guidance are presented as solutions to accumulated errors, but the section provides no quantitative tracking of error growth, ablation isolating each component, or metrics comparing guided vs. unguided long sequences, weakening the mitigation claim.
minor comments (2)
- [§3.1] Clarify the exact architecture and training details of the 3D autoencoder (e.g., compression ratio, loss terms) in the main text rather than deferring entirely to supplementary material.
- [Figures] Figure captions and legends should explicitly state dataset, resolution, and number of frames for each qualitative example to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thorough and constructive review. We address each major comment point-by-point below, clarifying our current results and outlining specific revisions that will strengthen the quantitative support for our claims.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the central claim of outperforming prior pixel-space video diffusion models rests on qualitative comparisons and 'extensive experiments' on small-domain datasets, but the manuscript supplies no quantitative metrics (e.g., FVD, FID, PSNR), error bars, ablation tables, or explicit baseline specifications, leaving the outperformance assertion only partially supported.
Authors: We agree that quantitative metrics would provide stronger evidence. In the revised manuscript we will add FVD and FID scores computed on the generated videos, include error bars from multiple random seeds, provide explicit ablation tables, and clearly document the baseline implementations together with their compute budgets to enable direct comparison. revision: yes
-
Referee: [§3.1] §3.1 (Video Autoencoder and latent space): the low-dimensional 3D latent representation is load-bearing for both efficiency and fidelity claims, yet no reconstruction metrics, latent-dimension ablations, or spatio-temporal detail preservation analysis are reported; without these, it is unclear whether critical high-frequency or temporal information is retained.
Authors: We acknowledge that additional validation of the latent space is warranted. The revised manuscript will report reconstruction metrics (PSNR, SSIM) for the 3D video autoencoder, include ablations across latent dimensions, and provide both quantitative and qualitative analysis confirming preservation of high-frequency spatial and temporal details. revision: yes
-
Referee: [§4.3] §4.3 (Long-video extension): conditional latent perturbation and unconditional guidance are presented as solutions to accumulated errors, but the section provides no quantitative tracking of error growth, ablation isolating each component, or metrics comparing guided vs. unguided long sequences, weakening the mitigation claim.
Authors: We agree that quantitative evidence for the error-mitigation techniques would strengthen the section. We will add plots tracking error growth over video length, ablations that isolate conditional latent perturbation and unconditional guidance, and direct metric comparisons between guided and unguided long-sequence generation. revision: yes
Circularity Check
Derivation chain is self-contained with no reductions to fitted inputs or self-citations
full rationale
The paper extends standard diffusion models by introducing a low-dimensional 3D latent space via a video autoencoder, hierarchical diffusion for long sequences, and conditional latent perturbation plus unconditional guidance to mitigate error accumulation. These components are described as new additions with explicit training and sampling procedures. No equations in the provided abstract or description reduce performance claims to quantities defined solely by parameters fitted inside the paper, nor do any load-bearing steps rely on self-citations that themselves reduce to unverified assumptions. The central claims rest on standard diffusion mechanics plus independently motivated architectural extensions, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Video data can be losslessly compressed into a low-dimensional 3D latent space that still supports high-fidelity reconstruction after diffusion sampling.
- domain assumption Hierarchical diffusion in latent space plus the two proposed correction mechanisms can prevent error accumulation over sequences longer than 1000 frames.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We compress videos using a lightweight 3D autoencoder... spatial and temporal downsampling factors of 8 and 4... hierarchical latent video diffusion models... conditional latent perturbation and unconditional guidance
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose to perform diffusion and denoising on the video latent space... Lsimple(θ) := ∥ϵθ(zt, t) − ϵ∥2 2
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 41 Pith papers
-
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
-
GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization
GaitProtector optimizes diffusion model latents to impersonate target identities in gait sequences, dropping Rank-1 identification accuracy from 89.6% to 15.0% on CASIA-B while keeping scoliosis diagnostic accuracy at 74.2%.
-
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
-
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation
ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.
-
FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation
FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
-
EduVQA: Towards Concept-Aware Assessment of Educational AI-Generated Videos
EduVQA introduces the first concept-aware benchmark for educational AI-generated video assessment and a S2D-MoE framework that jointly evaluates perceptual quality and fine-grained semantic alignment.
-
One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer
One-to-All Animation enables alignment-free character animation and image pose transfer via self-supervised outpainting reformulation, reference extraction, hybrid fusion attention, identity-robust pose control, and t...
-
History-Guided Video Diffusion
DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.
-
Learning Interactive Real-World Simulators
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
-
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
-
SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation
SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in ca...
-
DiffATS: Diffusion in Aligned Tensor Space
DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with...
-
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
-
Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos
EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.
-
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
-
Latent-Compressed Variational Autoencoder for Video Diffusion Models
A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
-
Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation
Causal Forcing initializes autoregressive diffusion students from AR teachers to recover flow maps that bidirectional teachers cannot provide, delivering 19%+ gains over Self Forcing on dynamic degree and related metrics.
-
Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation
Causal Forcing uses an autoregressive teacher for ODE initialization in diffusion distillation to close the causal attention gap and deliver better real-time video generation than Self Forcing.
-
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.
-
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.
-
LongLive: Real-time Interactive Long Video Generation
LongLive is a causal autoregressive video generator that produces up to 240-second interactive videos at 20.7 FPS on one H100 GPU after 32 GPU-days of fine-tuning from a 1.3B short-clip model.
-
ReSim: Reliable World Simulation for Autonomous Driving
ReSim is a controllable video world model trained on heterogeneous real and simulated driving data that achieves higher fidelity and controllability for both expert and non-expert actions, plus a Video2Reward module f...
-
We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback
NeuS-E is a post-generation refinement method that uses neuro-symbolic analysis of a formal video representation to detect and correct semantic and temporal inconsistencies in text-to-video outputs, improving prompt a...
-
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
-
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
FAR baseline plus asymmetric kernels for long short-term context modeling achieves SOTA short and long video generation in autoregressive setups.
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
Latte: Latent Diffusion Transformer for Video Generation
Latte achieves state-of-the-art video generation on FaceForensics, SkyTimelapse, UCF101, and Taichi-HD by using a latent diffusion transformer with four efficient spatial-temporal decomposition variants and best-pract...
-
VideoPoet: A Large Language Model for Zero-Shot Video Generation
VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.
-
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
-
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
-
SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation
SWoMo decouples symbolic rule-based motion modeling from diffusion-based visual realism using inverse pairing of reconstructed real videos to enable sim-to-real translation and generalization in cataract surgery simulations.
-
SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation
SWoMo decouples symbolic rule-based motion modeling via scene graphs from visual realism via diffusion models, trained through inverse pairing of real cataract surgery videos reconstructed in the simulator for sim-to-...
-
DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation
DepthPilot generates physically consistent and clinically interpretable colonoscopy videos by injecting depth priors into diffusion models through parameter-efficient fine-tuning and replacing linear denoising weights...
-
Not all tokens contribute equally to diffusion learning
DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.
-
DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment
DriVerse is a generative model that simulates driving scenes from an image and trajectory using multimodal prompting and motion alignment, achieving better performance on nuScenes and Waymo datasets with minimal training.
-
Empowering Video Translation using Multimodal Large Language Models
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
-
Character-Centered Dialogue Generation from Scene-Level Prompts
A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.
-
Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling
A prompt fusion approach combines bidirectional time-weighted latent blending, dynamics-informed prompt weighting via CLIP, and semantic action representations to produce temporally consistent long videos from text wi...
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Reference graph
Works this paper leans on
-
[1]
Large scale GAN training for high fidelity natural image synthesis
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019. 1
work page 2019
-
[2]
Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei A Efros, and Tero Karras. Generating long videos of dynamic scenes. arXiv preprint arXiv:2206.03429, 2022. 1, 6
-
[3]
Hier- archical video generation for complex data
Lluis Castrejon, Nicolas Ballas, and Aaron Courville. Hier- archical video generation for complex data. arXiv preprint arXiv:2106.02719, 2021. 5
-
[4]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Informa- tion Processing Systems, 34:8780–8794, 2021. 1, 3, 5
work page 2021
-
[5]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 1, 3, 5
work page 2021
-
[6]
Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time- sensitive transformer. arXiv preprint arXiv:2204.03638 ,
-
[7]
Probabilistic video generation using holis- tic attribute control
Jiawei He, Andreas Lehrmann, Joseph Marino, Greg Mori, and Leonid Sigal. Probabilistic video generation using holis- tic attribute control. In Proceedings of the European Confer- ence on Computer Vision (ECCV), pages 452–467, 2018. 1, 3
work page 2018
-
[8]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els. arXiv preprint arXiv:2210.02303, 2022. 1, 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
Denoising diffu- sion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020. 1, 3, 5
work page 2020
-
[10]
Cascaded diffusion models for high fidelity image generation
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47–1, 2022. 3, 6
work page 2022
-
[11]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 6
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models. arXiv preprint arXiv:2204.03458, 2022. 1, 4, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022. 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Alias-free generative adversarial networks
Tero Karras, Miika Aittala, Samuli Laine, Erik H ¨ark¨onen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. In NeurIPS, 2021. 1
work page 2021
-
[15]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019. 1
work page 2019
-
[16]
Analyzing and improving the image quality of StyleGAN
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In CVPR, 2020. 1
work page 2020
-
[17]
Melgan: Generative adversarial networks for conditional waveform synthesis.CVPR, 2019a
Manoj Kumar, Mohammad Babaeizadeh, Dumitru Er- han, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. Videoflow: A conditional flow-based model for stochastic video generation. arXiv preprint arXiv:1903.01434, 2019. 1, 3
-
[18]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR,
-
[20]
Neural Discrete Representation Learning
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017. 3
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
Ruslan Rakhimov, Denis V olkhonskiy, Alexey Artemov, De- nis Zorin, and Evgeny Burnaev. Latent video transformer. arXiv preprint arXiv:2006.10704, 2020. 1
-
[22]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents. arXiv preprint arXiv:2204.06125,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 3, 4
work page 2022
-
[24]
U- net: Convolutional networks for biomedical image segmen- tation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015. 5
work page 2015
-
[25]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
Tempo- ral generative adversarial nets with singular value clipping
Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Tempo- ral generative adversarial nets with singular value clipping. In Proceedings of the IEEE international conference on com- puter vision, pages 2830–2839, 2017. 1, 3
work page 2017
-
[27]
Masaki Saito, Shunta Saito, Masanori Koyama, and So- suke Kobayashi. Train sparsely, generate densely: Memory- efficient unsupervised training of high-resolution temporal gan. International Journal of Computer Vision , 128:2586– 2606, 2020. 1, 3, 6, 7
work page 2020
-
[28]
First order motion model for image animation
Aliaksandr Siarohin, St ´ephane Lathuili`ere, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. NeurIPS, 2019. 6
work page 2019
-
[29]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2
Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elho- seiny. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 3626–3636, 2022. 1, 3
work page 2022
-
[31]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Confer- ence on Machine Learning, pages 2256–2265. PMLR, 2015. 3
work page 2015
-
[32]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[33]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. arXiv preprint arXiv:2011.13456, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[34]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[35]
Yu Tian, Jian Ren, Menglei Chai, Kyle Olszewski, Xi Peng, Dimitris N. Metaxas, and Sergey Tulyakov. A good image generator is what you need for high-resolution video synthe- sis. In International Conference on Learning Representa- tions, 2021. 1, 3, 6, 7
work page 2021
-
[36]
Mocogan: Decomposing motion and content for video generation
Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535,
-
[37]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018. 6
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[38]
To- wards accurate generative models of video: A new metric & challenges
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges. ICLR, 2019. 6
work page 2019
-
[39]
Vikram V oleti, Alexia Jolicoeur-Martineau, and Christopher Pal. Masked conditional video diffusion for prediction, gen- eration, and interpolation. arXiv preprint arXiv:2205.09853,
-
[40]
Generating videos with scene dynamics
Carl V ondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. Advances in neu- ral information processing systems, 29, 2016. 1, 3
work page 2016
-
[41]
Jacob Walker, Ali Razavi, and A ¨aron van den Oord. Pre- dicting video with vqvae. arXiv preprint arXiv:2103.01950,
-
[42]
Scal- ing autoregressive video models.arXiv preprint arXiv:1906.02634,
Dirk Weissenborn, Oscar T ¨ackstr¨om, and Jakob Uszkor- eit. Scaling autoregressive video models. arXiv preprint arXiv:1906.02634, 2019. 1
-
[43]
Learning to generate time-lapse videos using multi-stage dy- namic generative adversarial networks
Wei Xiong, Wenhan Luo, Lin Ma, Wei Liu, and Jiebo Luo. Learning to generate time-lapse videos using multi-stage dy- namic generative adversarial networks. In The IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , June 2018. 3, 6
work page 2018
-
[44]
VideoGPT: Video Generation using VQ-VAE and Transformers
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and trans- formers. arXiv preprint arXiv:2104.10157, 2021. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[45]
Video probabilistic diffusion models in projected latent space
Sihyun Yu, Kihyuk Sohn, Subin Kim, and Jinwoo Shin. Video probabilistic diffusion models in projected latent space. arXiv preprint arXiv:2302.07685, 2023. 4
-
[46]
Generating videos with dynamics-aware implicit generative adversarial net- works
Sihyun Yu, Jihoon Tack, Sangwoo Mo, Hyunsu Kim, Junho Kim, Jung-Woo Ha, and Jinwoo Shin. Generating videos with dynamics-aware implicit generative adversarial net- works. In International Conference on Learning Represen- tations, 2022. 3, 6, 7
work page 2022
-
[47]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. 5
work page 2018
-
[48]
MagicVideo: Efficient Video Generation With Latent Diffusion Models
Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022. 1, 4
work page internal anchor Pith review arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.