arxiv: 2211.13221 · v2 · submitted 2022-11-23 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He , Tianyu Yang , Yong Zhang , Ying Shan , Qifeng Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video diffusionlatent spacelong video generationhierarchical diffusiontext-to-video3D latentconditional perturbation

0 comments

The pith

Video diffusion models shift to a low-dimensional 3D latent space to generate realistic clips longer than 1000 frames with modest compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that shifting the diffusion process for video into a compressed three-dimensional latent space produces better-looking results than working directly on pixels while using far less computation. A hierarchical scheme in that space then allows the model to build videos longer than one thousand frames by generating them in stages. To stop quality from dropping as the sequence grows, the authors insert controlled noise into the latent representations and apply an unconditional guidance step that corrects accumulated mistakes. Tests on small specialized datasets confirm longer and more realistic output than earlier methods, with an additional demonstration on large-scale text-conditioned generation. Readers would care because practical video synthesis has been blocked by either short length or high hardware demands.

Core claim

We introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, which significantly outperforms previous pixel-space video diffusion models under a limited computational budget. We propose hierarchical diffusion in the latent space to produce longer videos with more than one thousand frames. Conditional latent perturbation and unconditional guidance are added to mitigate accumulated errors during video length extension.

What carries the argument

Low-dimensional 3D latent space for the diffusion process, together with hierarchical diffusion, conditional latent perturbation, and unconditional guidance.

If this is right

Videos exceeding 1000 frames become feasible without proportional growth in required computation.
Output realism exceeds that of prior pixel-space diffusion models when compute is constrained.
Conditional latent perturbation and unconditional guidance reduce error buildup over extended sequences.
The framework scales to large-scale text-to-video tasks while preserving the efficiency gains.
Results hold across small domain-specific datasets of varied categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent compression and hierarchy might enable real-time or on-device video synthesis on consumer hardware.
Hierarchical latent diffusion could transfer to related tasks such as long audio generation or sequential image synthesis.
Future checks could verify whether fine motion details survive repeated latent compression and extension steps.
Pairing the approach with existing video codecs might push feasible sequence lengths even further.

Load-bearing premise

The compressed 3D latent space retains enough spatial-temporal information to allow high-fidelity video generation without irreversible detail loss.

What would settle it

Train the model on a held-out dataset, generate sequences exceeding 1000 frames, and measure whether visual artifacts or temporal inconsistencies appear that are absent in equivalent pixel-space diffusion runs at higher compute cost.

read the original abstract

AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a practical route to longer video diffusion by compressing into 3D latents and layering hierarchical scheduling with perturbation guidance, but the gains rest on qualitative claims without numbers or ablations.

read the letter

The core advance is moving diffusion from pixels to a low-dimensional 3D latent space produced by a video autoencoder, then applying hierarchical diffusion to reach over 1000 frames while adding conditional latent perturbation and unconditional guidance to limit error accumulation. This setup is presented as running under tighter compute budgets than prior pixel-space video diffusion work and is tested on small-domain datasets plus a text-to-video extension. The recipe is concrete and the components are described clearly enough that someone could reimplement the pipeline from the text. The hierarchical scheduling and the two mitigation steps are sensible engineering moves that directly target the length problem without requiring a full redesign of the diffusion process. Those pieces give the work its incremental value over earlier diffusion papers that stayed in pixel space. The main weakness is the evidence base. The abstract and claims rely on qualitative comparisons and visual examples on narrow datasets, with no reported quantitative metrics, error bars, ablation tables, or reconstruction PSNR/SSIM numbers for the autoencoder itself. That leaves the central assumption—that the compressed latent space preserves enough spatial-temporal detail for high-fidelity output—unverified in the supplied material. If the latent compression drops high-frequency motion or texture, the later guidance steps cannot recover it. The experiments stay on small domains, so generalization to open-world or high-resolution cases is not shown. This paper is useful for groups already building video diffusion systems who need a working template for length extension under compute limits. It is not a foundational theoretical result, but the method is coherent and addresses a real practical bottleneck. A serious editor should send it to referees so the authors can add the missing metrics and ablations; the current version is too thin for acceptance but worth the review cycle.

Referee Report

3 major / 2 minor

Summary. The paper proposes latent video diffusion models operating in a low-dimensional 3D latent space to enable lightweight, high-fidelity video generation that outperforms pixel-space baselines under limited compute. It introduces hierarchical diffusion to produce videos exceeding 1000 frames and conditional latent perturbation plus unconditional guidance to mitigate error accumulation during length extension. Claims are supported by qualitative results on small-domain datasets across categories plus a text-to-video extension.

Significance. If the central claims hold under rigorous evaluation, the work would advance efficient generative video modeling by showing how latent-space diffusion can reduce computational cost while scaling to long sequences, addressing key bottlenecks in current video diffusion approaches.

major comments (3)

[Experiments] Experiments section: the central claim of outperforming prior pixel-space video diffusion models rests on qualitative comparisons and 'extensive experiments' on small-domain datasets, but the manuscript supplies no quantitative metrics (e.g., FVD, FID, PSNR), error bars, ablation tables, or explicit baseline specifications, leaving the outperformance assertion only partially supported.
[§3.1] §3.1 (Video Autoencoder and latent space): the low-dimensional 3D latent representation is load-bearing for both efficiency and fidelity claims, yet no reconstruction metrics, latent-dimension ablations, or spatio-temporal detail preservation analysis are reported; without these, it is unclear whether critical high-frequency or temporal information is retained.
[§4.3] §4.3 (Long-video extension): conditional latent perturbation and unconditional guidance are presented as solutions to accumulated errors, but the section provides no quantitative tracking of error growth, ablation isolating each component, or metrics comparing guided vs. unguided long sequences, weakening the mitigation claim.

minor comments (2)

[§3.1] Clarify the exact architecture and training details of the 3D autoencoder (e.g., compression ratio, loss terms) in the main text rather than deferring entirely to supplementary material.
[Figures] Figure captions and legends should explicitly state dataset, resolution, and number of frames for each qualitative example to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review. We address each major comment point-by-point below, clarifying our current results and outlining specific revisions that will strengthen the quantitative support for our claims.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim of outperforming prior pixel-space video diffusion models rests on qualitative comparisons and 'extensive experiments' on small-domain datasets, but the manuscript supplies no quantitative metrics (e.g., FVD, FID, PSNR), error bars, ablation tables, or explicit baseline specifications, leaving the outperformance assertion only partially supported.

Authors: We agree that quantitative metrics would provide stronger evidence. In the revised manuscript we will add FVD and FID scores computed on the generated videos, include error bars from multiple random seeds, provide explicit ablation tables, and clearly document the baseline implementations together with their compute budgets to enable direct comparison. revision: yes
Referee: [§3.1] §3.1 (Video Autoencoder and latent space): the low-dimensional 3D latent representation is load-bearing for both efficiency and fidelity claims, yet no reconstruction metrics, latent-dimension ablations, or spatio-temporal detail preservation analysis are reported; without these, it is unclear whether critical high-frequency or temporal information is retained.

Authors: We acknowledge that additional validation of the latent space is warranted. The revised manuscript will report reconstruction metrics (PSNR, SSIM) for the 3D video autoencoder, include ablations across latent dimensions, and provide both quantitative and qualitative analysis confirming preservation of high-frequency spatial and temporal details. revision: yes
Referee: [§4.3] §4.3 (Long-video extension): conditional latent perturbation and unconditional guidance are presented as solutions to accumulated errors, but the section provides no quantitative tracking of error growth, ablation isolating each component, or metrics comparing guided vs. unguided long sequences, weakening the mitigation claim.

Authors: We agree that quantitative evidence for the error-mitigation techniques would strengthen the section. We will add plots tracking error growth over video length, ablations that isolate conditional latent perturbation and unconditional guidance, and direct metric comparisons between guided and unguided long-sequence generation. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with no reductions to fitted inputs or self-citations

full rationale

The paper extends standard diffusion models by introducing a low-dimensional 3D latent space via a video autoencoder, hierarchical diffusion for long sequences, and conditional latent perturbation plus unconditional guidance to mitigate error accumulation. These components are described as new additions with explicit training and sampling procedures. No equations in the provided abstract or description reduce performance claims to quantities defined solely by parameters fitted inside the paper, nor do any load-bearing steps rely on self-citations that themselves reduce to unverified assumptions. The central claims rest on standard diffusion mechanics plus independently motivated architectural extensions, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that a learned low-dimensional 3D latent space retains enough information for high-fidelity video reconstruction and that standard diffusion assumptions (Gaussian forward process, learned reverse process) transfer directly to this compressed space.

axioms (2)

domain assumption Video data can be losslessly compressed into a low-dimensional 3D latent space that still supports high-fidelity reconstruction after diffusion sampling.
Invoked in the first paragraph of the abstract as the basis for the lightweight model.
domain assumption Hierarchical diffusion in latent space plus the two proposed correction mechanisms can prevent error accumulation over sequences longer than 1000 frames.
Central to the long-video claim but presented without derivation or proof in the abstract.

pith-pipeline@v0.9.0 · 5488 in / 1405 out tokens · 77998 ms · 2026-05-15T04:23:42.819005+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We compress videos using a lightweight 3D autoencoder... spatial and temporal downsampling factors of 8 and 4... hierarchical latent video diffusion models... conditional latent perturbation and unconditional guidance
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose to perform diffusion and denoising on the video latent space... Lsimple(θ) := ∥ϵθ(zt, t) − ϵ∥2 2

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization
cs.CV 2026-05 unverdicted novelty 7.0

GaitProtector optimizes diffusion model latents to impersonate target identities in gait sequences, dropping Rank-1 identification accuracy from 89.6% to 15.0% on CASIA-B while keeping scoliosis diagnostic accuracy at 74.2%.
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
cs.CV 2026-05 unverdicted novelty 7.0

DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation
cs.CV 2026-04 conditional novelty 7.0

SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation
cs.CV 2026-03 unverdicted novelty 7.0

ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
cs.RO 2026-05 unverdicted novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in ca...
DiffATS: Diffusion in Aligned Tensor Space
cs.LG 2026-05 unverdicted novelty 6.0

DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with...
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
cs.CV 2026-04 unverdicted novelty 6.0

Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos
cs.CV 2026-04 unverdicted novelty 6.0

EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
Latent-Compressed Variational Autoencoder for Video Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
LongLive: Real-time Interactive Long Video Generation
cs.CV 2025-09 conditional novelty 6.0

LongLive is a causal autoregressive video generator that produces up to 240-second interactive videos at 20.7 FPS on one H100 GPU after 32 GPU-days of fine-tuning from a 1.3B short-clip model.
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
cs.CV 2025-03 accept novelty 6.0

VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
cs.CV 2024-04 unverdicted novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
Latte: Latent Diffusion Transformer for Video Generation
cs.CV 2024-01 unverdicted novelty 6.0

Latte achieves state-of-the-art video generation on FaceForensics, SkyTimelapse, UCF101, and Taichi-HD by using a latent diffusion transformer with four efficient spatial-temporal decomposition variants and best-pract...
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
cs.CV 2023-10 unverdicted novelty 6.0

Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

DepthPilot generates physically consistent and clinically interpretable colonoscopy videos by injecting depth priors into diffusion models through parameter-efficient fine-tuning and replacing linear denoising weights...
Not all tokens contribute equally to diffusion learning
cs.CV 2026-04 unverdicted novelty 5.0

DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.
Empowering Video Translation using Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
Evolution of Video Generative Foundations
cs.CV 2026-04 unverdicted novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 22 Pith papers · 14 internal anchors

[1]

Large scale GAN training for high ﬁdelity natural image synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high ﬁdelity natural image synthesis. In ICLR, 2019. 1

work page 2019
[2]

Generating long videos of dynamic scenes

Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei A Efros, and Tero Karras. Generating long videos of dynamic scenes. arXiv preprint arXiv:2206.03429, 2022. 1, 6

work page arXiv 2022
[3]

Hier- archical video generation for complex data

Lluis Castrejon, Nicolas Ballas, and Aaron Courville. Hier- archical video generation for complex data. arXiv preprint arXiv:2106.02719, 2021. 5

work page arXiv 2021
[4]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Informa- tion Processing Systems, 34:8780–8794, 2021. 1, 3, 5

work page 2021
[5]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 1, 3, 5

work page 2021
[6]

Long video generation with time-agnostic vqgan and time- sensitive transformer

Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time- sensitive transformer. arXiv preprint arXiv:2204.03638 ,

work page arXiv
[7]

Probabilistic video generation using holis- tic attribute control

Jiawei He, Andreas Lehrmann, Joseph Marino, Greg Mori, and Leonid Sigal. Probabilistic video generation using holis- tic attribute control. In Proceedings of the European Confer- ence on Computer Vision (ECCV), pages 452–467, 2018. 1, 3

work page 2018
[8]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High deﬁnition video generation with diffusion mod- els. arXiv preprint arXiv:2210.02303, 2022. 1, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020. 1, 3, 5

work page 2020
[10]

Cascaded diffusion models for high ﬁdelity image generation

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high ﬁdelity image generation. J. Mach. Learn. Res., 23:47–1, 2022. 3, 6

work page 2022
[11]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Video Diffusion Models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models. arXiv preprint arXiv:2204.03458, 2022. 1, 4, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022. 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Alias-free generative adversarial networks

Tero Karras, Miika Aittala, Samuli Laine, Erik H ¨ark¨onen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. In NeurIPS, 2021. 1

work page 2021
[15]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019. 1

work page 2019
[16]

Analyzing and improving the image quality of StyleGAN

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In CVPR, 2020. 1

work page 2020
[17]

Videoﬂow: A conditional ﬂow-based model for stochastic video generation

Manoj Kumar, Mohammad Babaeizadeh, Dumitru Er- han, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. Videoﬂow: A conditional ﬂow-based model for stochastic video generation. arXiv preprint arXiv:1903.01434, 2019. 1, 3

work page arXiv 1903
[18]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR,

work page
[20]

Neural Discrete Representation Learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017. 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

Latent video transformer

Ruslan Rakhimov, Denis V olkhonskiy, Alexey Artemov, De- nis Zorin, and Evgeny Burnaev. Latent video transformer. arXiv preprint arXiv:2006.10704, 2020. 1

work page arXiv 2006
[22]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents. arXiv preprint arXiv:2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 3, 4

work page 2022
[24]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015. 5

work page 2015
[25]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Tempo- ral generative adversarial nets with singular value clipping

Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Tempo- ral generative adversarial nets with singular value clipping. In Proceedings of the IEEE international conference on com- puter vision, pages 2830–2839, 2017. 1, 3

work page 2017
[27]

Train sparsely, generate densely: Memory- efﬁcient unsupervised training of high-resolution temporal gan

Masaki Saito, Shunta Saito, Masanori Koyama, and So- suke Kobayashi. Train sparsely, generate densely: Memory- efﬁcient unsupervised training of high-resolution temporal gan. International Journal of Computer Vision , 128:2586– 2606, 2020. 1, 3, 6, 7

work page 2020
[28]

First order motion model for image animation

Aliaksandr Siarohin, St ´ephane Lathuili`ere, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. NeurIPS, 2019. 6

work page 2019
[29]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 ,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2

Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elho- seiny. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 3626–3636, 2022. 1, 3

work page 2022
[31]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Confer- ence on Machine Learning, pages 2256–2265. PMLR, 2015. 3

work page 2015
[32]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[33]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. arXiv preprint arXiv:2011.13456, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2011
[34]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2012
[35]

Metaxas, and Sergey Tulyakov

Yu Tian, Jian Ren, Menglei Chai, Kyle Olszewski, Xi Peng, Dimitris N. Metaxas, and Sergey Tulyakov. A good image generator is what you need for high-resolution video synthe- sis. In International Conference on Learning Representa- tions, 2021. 1, 3, 6, 7

work page 2021
[36]

Mocogan: Decomposing motion and content for video generation

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535,

work page
[37]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018
[38]

To- wards accurate generative models of video: A new metric & challenges

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges. ICLR, 2019. 6

work page 2019
[39]

Masked conditional video diffusion for prediction, gen- eration, and interpolation

Vikram V oleti, Alexia Jolicoeur-Martineau, and Christopher Pal. Masked conditional video diffusion for prediction, gen- eration, and interpolation. arXiv preprint arXiv:2205.09853,

work page arXiv
[40]

Generating videos with scene dynamics

Carl V ondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. Advances in neu- ral information processing systems, 29, 2016. 1, 3

work page 2016
[41]

Pre- dicting video with vqvae

Jacob Walker, Ali Razavi, and A ¨aron van den Oord. Pre- dicting video with vqvae. arXiv preprint arXiv:2103.01950,

work page arXiv
[42]

Scaling autoregressive video models

Dirk Weissenborn, Oscar T ¨ackstr¨om, and Jakob Uszkor- eit. Scaling autoregressive video models. arXiv preprint arXiv:1906.02634, 2019. 1

work page arXiv 1906
[43]

Learning to generate time-lapse videos using multi-stage dy- namic generative adversarial networks

Wei Xiong, Wenhan Luo, Lin Ma, Wei Liu, and Jiebo Luo. Learning to generate time-lapse videos using multi-stage dy- namic generative adversarial networks. In The IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , June 2018. 3, 6

work page 2018
[44]

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and trans- formers. arXiv preprint arXiv:2104.10157, 2021. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[45]

Video probabilistic diffusion models in projected latent space

Sihyun Yu, Kihyuk Sohn, Subin Kim, and Jinwoo Shin. Video probabilistic diffusion models in projected latent space. arXiv preprint arXiv:2302.07685, 2023. 4

work page arXiv 2023
[46]

Generating videos with dynamics-aware implicit generative adversarial net- works

Sihyun Yu, Jihoon Tack, Sangwoo Mo, Hyunsu Kim, Junho Kim, Jung-Woo Ha, and Jinwoo Shin. Generating videos with dynamics-aware implicit generative adversarial net- works. In International Conference on Learning Represen- tations, 2022. 3, 6, 7

work page 2022
[47]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. 5

work page 2018
[48]

arXiv:2211.11018 , year=

Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efﬁcient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022. 1, 4

work page arXiv 2022