pith. sign in

arxiv: 2211.13221 · v2 · pith:U52T37GUnew · submitted 2022-11-23 · 💻 cs.CV · cs.AI

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Pith reviewed 2026-05-15 04:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video diffusionlatent spacelong video generationhierarchical diffusiontext-to-video3D latentconditional perturbation
0
0 comments X

The pith

Video diffusion models shift to a low-dimensional 3D latent space to generate realistic clips longer than 1000 frames with modest compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that shifting the diffusion process for video into a compressed three-dimensional latent space produces better-looking results than working directly on pixels while using far less computation. A hierarchical scheme in that space then allows the model to build videos longer than one thousand frames by generating them in stages. To stop quality from dropping as the sequence grows, the authors insert controlled noise into the latent representations and apply an unconditional guidance step that corrects accumulated mistakes. Tests on small specialized datasets confirm longer and more realistic output than earlier methods, with an additional demonstration on large-scale text-conditioned generation. Readers would care because practical video synthesis has been blocked by either short length or high hardware demands.

Core claim

We introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, which significantly outperforms previous pixel-space video diffusion models under a limited computational budget. We propose hierarchical diffusion in the latent space to produce longer videos with more than one thousand frames. Conditional latent perturbation and unconditional guidance are added to mitigate accumulated errors during video length extension.

What carries the argument

Low-dimensional 3D latent space for the diffusion process, together with hierarchical diffusion, conditional latent perturbation, and unconditional guidance.

If this is right

  • Videos exceeding 1000 frames become feasible without proportional growth in required computation.
  • Output realism exceeds that of prior pixel-space diffusion models when compute is constrained.
  • Conditional latent perturbation and unconditional guidance reduce error buildup over extended sequences.
  • The framework scales to large-scale text-to-video tasks while preserving the efficiency gains.
  • Results hold across small domain-specific datasets of varied categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent compression and hierarchy might enable real-time or on-device video synthesis on consumer hardware.
  • Hierarchical latent diffusion could transfer to related tasks such as long audio generation or sequential image synthesis.
  • Future checks could verify whether fine motion details survive repeated latent compression and extension steps.
  • Pairing the approach with existing video codecs might push feasible sequence lengths even further.

Load-bearing premise

The compressed 3D latent space retains enough spatial-temporal information to allow high-fidelity video generation without irreversible detail loss.

What would settle it

Train the model on a held-out dataset, generate sequences exceeding 1000 frames, and measure whether visual artifacts or temporal inconsistencies appear that are absent in equivalent pixel-space diffusion runs at higher compute cost.

read the original abstract

AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes latent video diffusion models operating in a low-dimensional 3D latent space to enable lightweight, high-fidelity video generation that outperforms pixel-space baselines under limited compute. It introduces hierarchical diffusion to produce videos exceeding 1000 frames and conditional latent perturbation plus unconditional guidance to mitigate error accumulation during length extension. Claims are supported by qualitative results on small-domain datasets across categories plus a text-to-video extension.

Significance. If the central claims hold under rigorous evaluation, the work would advance efficient generative video modeling by showing how latent-space diffusion can reduce computational cost while scaling to long sequences, addressing key bottlenecks in current video diffusion approaches.

major comments (3)
  1. [Experiments] Experiments section: the central claim of outperforming prior pixel-space video diffusion models rests on qualitative comparisons and 'extensive experiments' on small-domain datasets, but the manuscript supplies no quantitative metrics (e.g., FVD, FID, PSNR), error bars, ablation tables, or explicit baseline specifications, leaving the outperformance assertion only partially supported.
  2. [§3.1] §3.1 (Video Autoencoder and latent space): the low-dimensional 3D latent representation is load-bearing for both efficiency and fidelity claims, yet no reconstruction metrics, latent-dimension ablations, or spatio-temporal detail preservation analysis are reported; without these, it is unclear whether critical high-frequency or temporal information is retained.
  3. [§4.3] §4.3 (Long-video extension): conditional latent perturbation and unconditional guidance are presented as solutions to accumulated errors, but the section provides no quantitative tracking of error growth, ablation isolating each component, or metrics comparing guided vs. unguided long sequences, weakening the mitigation claim.
minor comments (2)
  1. [§3.1] Clarify the exact architecture and training details of the 3D autoencoder (e.g., compression ratio, loss terms) in the main text rather than deferring entirely to supplementary material.
  2. [Figures] Figure captions and legends should explicitly state dataset, resolution, and number of frames for each qualitative example to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review. We address each major comment point-by-point below, clarifying our current results and outlining specific revisions that will strengthen the quantitative support for our claims.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim of outperforming prior pixel-space video diffusion models rests on qualitative comparisons and 'extensive experiments' on small-domain datasets, but the manuscript supplies no quantitative metrics (e.g., FVD, FID, PSNR), error bars, ablation tables, or explicit baseline specifications, leaving the outperformance assertion only partially supported.

    Authors: We agree that quantitative metrics would provide stronger evidence. In the revised manuscript we will add FVD and FID scores computed on the generated videos, include error bars from multiple random seeds, provide explicit ablation tables, and clearly document the baseline implementations together with their compute budgets to enable direct comparison. revision: yes

  2. Referee: [§3.1] §3.1 (Video Autoencoder and latent space): the low-dimensional 3D latent representation is load-bearing for both efficiency and fidelity claims, yet no reconstruction metrics, latent-dimension ablations, or spatio-temporal detail preservation analysis are reported; without these, it is unclear whether critical high-frequency or temporal information is retained.

    Authors: We acknowledge that additional validation of the latent space is warranted. The revised manuscript will report reconstruction metrics (PSNR, SSIM) for the 3D video autoencoder, include ablations across latent dimensions, and provide both quantitative and qualitative analysis confirming preservation of high-frequency spatial and temporal details. revision: yes

  3. Referee: [§4.3] §4.3 (Long-video extension): conditional latent perturbation and unconditional guidance are presented as solutions to accumulated errors, but the section provides no quantitative tracking of error growth, ablation isolating each component, or metrics comparing guided vs. unguided long sequences, weakening the mitigation claim.

    Authors: We agree that quantitative evidence for the error-mitigation techniques would strengthen the section. We will add plots tracking error growth over video length, ablations that isolate conditional latent perturbation and unconditional guidance, and direct metric comparisons between guided and unguided long-sequence generation. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with no reductions to fitted inputs or self-citations

full rationale

The paper extends standard diffusion models by introducing a low-dimensional 3D latent space via a video autoencoder, hierarchical diffusion for long sequences, and conditional latent perturbation plus unconditional guidance to mitigate error accumulation. These components are described as new additions with explicit training and sampling procedures. No equations in the provided abstract or description reduce performance claims to quantities defined solely by parameters fitted inside the paper, nor do any load-bearing steps rely on self-citations that themselves reduce to unverified assumptions. The central claims rest on standard diffusion mechanics plus independently motivated architectural extensions, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that a learned low-dimensional 3D latent space retains enough information for high-fidelity video reconstruction and that standard diffusion assumptions (Gaussian forward process, learned reverse process) transfer directly to this compressed space.

axioms (2)
  • domain assumption Video data can be losslessly compressed into a low-dimensional 3D latent space that still supports high-fidelity reconstruction after diffusion sampling.
    Invoked in the first paragraph of the abstract as the basis for the lightweight model.
  • domain assumption Hierarchical diffusion in latent space plus the two proposed correction mechanisms can prevent error accumulation over sequences longer than 1000 frames.
    Central to the long-video claim but presented without derivation or proof in the abstract.

pith-pipeline@v0.9.0 · 5488 in / 1405 out tokens · 77998 ms · 2026-05-15T04:23:42.819005+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

    cs.CV 2026-05 unverdicted novelty 7.0

    TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.

  2. GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization

    cs.CV 2026-05 unverdicted novelty 7.0

    GaitProtector optimizes diffusion model latents to impersonate target identities in gait sequences, dropping Rank-1 identification accuracy from 89.6% to 15.0% on CASIA-B while keeping scoliosis diagnostic accuracy at 74.2%.

  3. DCR: Counterfactual Attractor Guidance for Rare Compositional Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.

  4. Efficient Video Diffusion Models: Advancements and Challenges

    cs.CV 2026-04 unverdicted novelty 7.0

    A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

  5. Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

    cs.CV 2026-04 conditional novelty 7.0

    SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.

  6. ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

    cs.CV 2026-03 unverdicted novelty 7.0

    ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.

  7. FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation

    cs.CV 2026-03 unverdicted novelty 7.0

    FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.

  8. EduVQA: Towards Concept-Aware Assessment of Educational AI-Generated Videos

    cs.CV 2026-03 unverdicted novelty 7.0

    EduVQA introduces the first concept-aware benchmark for educational AI-generated video assessment and a S2D-MoE framework that jointly evaluates perceptual quality and fine-grained semantic alignment.

  9. One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer

    cs.CV 2025-11 unverdicted novelty 7.0

    One-to-All Animation enables alignment-free character animation and image pose transfer via self-supervised outpainting reformulation, reference extraction, hybrid fusion attention, identity-robust pose control, and t...

  10. History-Guided Video Diffusion

    cs.LG 2025-02 unverdicted novelty 7.0

    DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.

  11. Learning Interactive Real-World Simulators

    cs.AI 2023-10 conditional novelty 7.0

    UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

  12. RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

    cs.RO 2026-05 unverdicted novelty 6.0

    A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

  13. SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in ca...

  14. DiffATS: Diffusion in Aligned Tensor Space

    cs.LG 2026-05 unverdicted novelty 6.0

    DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with...

  15. Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.

  16. Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos

    cs.CV 2026-04 unverdicted novelty 6.0

    EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.

  17. Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.

  18. Latent-Compressed Variational Autoencoder for Video Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.

  19. Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

    cs.CV 2026-02 conditional novelty 6.0

    Causal Forcing initializes autoregressive diffusion students from AR teachers to recover flow maps that bidirectional teachers cannot provide, delivering 19%+ gains over Self Forcing on dynamic degree and related metrics.

  20. Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

    cs.CV 2026-02 conditional novelty 6.0

    Causal Forcing uses an autoregressive teacher for ODE initialization in diffusion distillation to close the causal attention gap and deliver better real-time video generation than Self Forcing.

  21. Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

    cs.CV 2025-12 conditional novelty 6.0

    Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.

  22. Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    cs.CV 2025-10 conditional novelty 6.0

    Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.

  23. LongLive: Real-time Interactive Long Video Generation

    cs.CV 2025-09 conditional novelty 6.0

    LongLive is a causal autoregressive video generator that produces up to 240-second interactive videos at 20.7 FPS on one H100 GPU after 32 GPU-days of fine-tuning from a 1.3B short-clip model.

  24. ReSim: Reliable World Simulation for Autonomous Driving

    cs.CV 2025-06 unverdicted novelty 6.0

    ReSim is a controllable video world model trained on heterogeneous real and simulated driving data that achieves higher fidelity and controllability for both expert and non-expert actions, plus a Video2Reward module f...

  25. We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback

    cs.CV 2025-04 unverdicted novelty 6.0

    NeuS-E is a post-generation refinement method that uses neuro-symbolic analysis of a formal video representation to detect and correct semantic and temporal inconsistencies in text-to-video outputs, improving prompt a...

  26. VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    cs.CV 2025-03 accept novelty 6.0

    VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...

  27. Long-Context Autoregressive Video Modeling with Next-Frame Prediction

    cs.CV 2025-03 unverdicted novelty 6.0

    FAR baseline plus asymmetric kernels for long short-term context modeling achieves SOTA short and long video generation in autoregressive setups.

  28. CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.

  29. Latte: Latent Diffusion Transformer for Video Generation

    cs.CV 2024-01 unverdicted novelty 6.0

    Latte achieves state-of-the-art video generation on FaceForensics, SkyTimelapse, UCF101, and Taichi-HD by using a latent diffusion transformer with four efficient spatial-temporal decomposition variants and best-pract...

  30. VideoPoet: A Large Language Model for Zero-Shot Video Generation

    cs.CV 2023-12 unverdicted novelty 6.0

    VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.

  31. VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    cs.CV 2023-10 unverdicted novelty 6.0

    Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.

  32. DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

    cs.CV 2023-08 unverdicted novelty 6.0

    DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.

  33. SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation

    cs.CV 2026-05 unverdicted novelty 5.0

    SWoMo decouples symbolic rule-based motion modeling from diffusion-based visual realism using inverse pairing of reconstructed real videos to enable sim-to-real translation and generalization in cataract surgery simulations.

  34. SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation

    cs.CV 2026-05 conditional novelty 5.0

    SWoMo decouples symbolic rule-based motion modeling via scene graphs from visual realism via diffusion models, trained through inverse pairing of real cataract surgery videos reconstructed in the simulator for sim-to-...

  35. DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    DepthPilot generates physically consistent and clinically interpretable colonoscopy videos by injecting depth priors into diffusion models through parameter-efficient fine-tuning and replacing linear denoising weights...

  36. Not all tokens contribute equally to diffusion learning

    cs.CV 2026-04 unverdicted novelty 5.0

    DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.

  37. DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment

    cs.RO 2025-04 unverdicted novelty 5.0

    DriVerse is a generative model that simulates driving scenes from an image and trajectory using multimodal prompting and motion alignment, achieving better performance on nuScenes and Waymo datasets with minimal training.

  38. Empowering Video Translation using Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 4.0

    The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

  39. Character-Centered Dialogue Generation from Scene-Level Prompts

    cs.CV 2025-05 unverdicted novelty 4.0

    A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.

  40. Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling

    cs.CV 2025-03 unverdicted novelty 3.0

    A prompt fusion approach combines bidirectional time-weighted latent blending, dynamics-informed prompt weighting via CLIP, and semantic action representations to produce temporally consistent long videos from text wi...

  41. Evolution of Video Generative Foundations

    cs.CV 2026-04 unverdicted novelty 2.0

    This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 39 Pith papers · 15 internal anchors

  1. [1]

    Large scale GAN training for high fidelity natural image synthesis

    Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019. 1

  2. [2]

    and Karras, Tero , year =

    Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei A Efros, and Tero Karras. Generating long videos of dynamic scenes. arXiv preprint arXiv:2206.03429, 2022. 1, 6

  3. [3]

    Hier- archical video generation for complex data

    Lluis Castrejon, Nicolas Ballas, and Aaron Courville. Hier- archical video generation for complex data. arXiv preprint arXiv:2106.02719, 2021. 5

  4. [4]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Informa- tion Processing Systems, 34:8780–8794, 2021. 1, 3, 5

  5. [5]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 1, 3, 5

  6. [6]

    2022 , journal =

    Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time- sensitive transformer. arXiv preprint arXiv:2204.03638 ,

  7. [7]

    Probabilistic video generation using holis- tic attribute control

    Jiawei He, Andreas Lehrmann, Joseph Marino, Greg Mori, and Leonid Sigal. Probabilistic video generation using holis- tic attribute control. In Proceedings of the European Confer- ence on Computer Vision (ECCV), pages 452–467, 2018. 1, 3

  8. [8]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els. arXiv preprint arXiv:2210.02303, 2022. 1, 4

  9. [9]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020. 1, 3, 5

  10. [10]

    Cascaded diffusion models for high fidelity image generation

    Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47–1, 2022. 3, 6

  11. [11]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 6

  12. [12]

    Video Diffusion Models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models. arXiv preprint arXiv:2204.03458, 2022. 1, 4, 5, 6, 7

  13. [13]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022. 5

  14. [14]

    Alias-free generative adversarial networks

    Tero Karras, Miika Aittala, Samuli Laine, Erik H ¨ark¨onen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. In NeurIPS, 2021. 1

  15. [15]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019. 1

  16. [16]

    Analyzing and improving the image quality of StyleGAN

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In CVPR, 2020. 1

  17. [17]

    Melgan: Generative adversarial networks for conditional waveform synthesis.CVPR, 2019a

    Manoj Kumar, Mohammad Babaeizadeh, Dumitru Er- han, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. Videoflow: A conditional flow-based model for stochastic video generation. arXiv preprint arXiv:1903.01434, 2019. 1, 3

  18. [18]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 1, 3

  19. [19]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR,

  20. [20]

    Neural Discrete Representation Learning

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017. 3

  21. [21]

    Latent video transformer

    Ruslan Rakhimov, Denis V olkhonskiy, Alexey Artemov, De- nis Zorin, and Evgeny Burnaev. Latent video transformer. arXiv preprint arXiv:2006.10704, 2020. 1

  22. [22]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents. arXiv preprint arXiv:2204.06125,

  23. [23]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 3, 4

  24. [24]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015. 5

  25. [25]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022. 1, 3

  26. [26]

    Tempo- ral generative adversarial nets with singular value clipping

    Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Tempo- ral generative adversarial nets with singular value clipping. In Proceedings of the IEEE international conference on com- puter vision, pages 2830–2839, 2017. 1, 3

  27. [27]

    Train sparsely, generate densely: Memory- efficient unsupervised training of high-resolution temporal gan

    Masaki Saito, Shunta Saito, Masanori Koyama, and So- suke Kobayashi. Train sparsely, generate densely: Memory- efficient unsupervised training of high-resolution temporal gan. International Journal of Computer Vision , 128:2586– 2606, 2020. 1, 3, 6, 7

  28. [28]

    First order motion model for image animation

    Aliaksandr Siarohin, St ´ephane Lathuili`ere, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. NeurIPS, 2019. 6

  29. [29]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 ,

  30. [30]

    Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2

    Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elho- seiny. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 3626–3636, 2022. 1, 3

  31. [31]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Confer- ence on Machine Learning, pages 2256–2265. PMLR, 2015. 3

  32. [32]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 3

  33. [33]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. arXiv preprint arXiv:2011.13456, 2020. 3

  34. [34]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 3, 6

  35. [35]

    Metaxas, and Sergey Tulyakov

    Yu Tian, Jian Ren, Menglei Chai, Kyle Olszewski, Xi Peng, Dimitris N. Metaxas, and Sergey Tulyakov. A good image generator is what you need for high-resolution video synthe- sis. In International Conference on Learning Representa- tions, 2021. 1, 3, 6, 7

  36. [36]

    Mocogan: Decomposing motion and content for video generation

    Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535,

  37. [37]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018. 6

  38. [38]

    To- wards accurate generative models of video: A new metric & challenges

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges. ICLR, 2019. 6

  39. [39]

    Masked conditional video diffusion for prediction, generation, and interpolation.arXiv preprint arXiv:2205.09853,

    Vikram V oleti, Alexia Jolicoeur-Martineau, and Christopher Pal. Masked conditional video diffusion for prediction, gen- eration, and interpolation. arXiv preprint arXiv:2205.09853,

  40. [40]

    Generating videos with scene dynamics

    Carl V ondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. Advances in neu- ral information processing systems, 29, 2016. 1, 3

  41. [41]

    Predicting video with vqvae

    Jacob Walker, Ali Razavi, and A ¨aron van den Oord. Pre- dicting video with vqvae. arXiv preprint arXiv:2103.01950,

  42. [42]

    Scal- ing autoregressive video models.arXiv preprint arXiv:1906.02634,

    Dirk Weissenborn, Oscar T ¨ackstr¨om, and Jakob Uszkor- eit. Scaling autoregressive video models. arXiv preprint arXiv:1906.02634, 2019. 1

  43. [43]

    Learning to generate time-lapse videos using multi-stage dy- namic generative adversarial networks

    Wei Xiong, Wenhan Luo, Lin Ma, Wei Liu, and Jiebo Luo. Learning to generate time-lapse videos using multi-stage dy- namic generative adversarial networks. In The IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , June 2018. 3, 6

  44. [44]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and trans- formers. arXiv preprint arXiv:2104.10157, 2021. 1, 3

  45. [45]

    Video probabilistic diffusion models in projected latent space

    Sihyun Yu, Kihyuk Sohn, Subin Kim, and Jinwoo Shin. Video probabilistic diffusion models in projected latent space. arXiv preprint arXiv:2302.07685, 2023. 4

  46. [46]

    Generating videos with dynamics-aware implicit generative adversarial net- works

    Sihyun Yu, Jihoon Tack, Sangwoo Mo, Hyunsu Kim, Junho Kim, Jung-Woo Ha, and Jinwoo Shin. Generating videos with dynamics-aware implicit generative adversarial net- works. In International Conference on Learning Represen- tations, 2022. 3, 6, 7

  47. [47]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. 5

  48. [48]

    MagicVideo: Efficient Video Generation With Latent Diffusion Models

    Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022. 1, 4