pith. machine review for the scientific record. sign in

arxiv: 2311.04145 · v1 · pith:F5OQYRSNnew · submitted 2023-11-07 · 💻 cs.CV

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

Pith reviewed 2026-05-17 23:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords image-to-video synthesisdiffusion modelscascaded generationvideo generationsemantic accuracyspatio-temporal continuityhigh-resolution video
0
0 comments X

The pith

A cascaded diffusion model guided by static images generates videos that keep semantic accuracy, detail continuity, and clarity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces I2VGen-XL to address persistent issues in turning a single image into a video, such as lost meaning, blurry frames, and jerky motion. It splits the generation into a base stage that locks in the input image's content and meaning using two hierarchical encoders, followed by a refinement stage that adds fine details and raises resolution with a short text prompt. Large training sets of 35 million text-video pairs and 6 billion text-image pairs supply the scale needed for this separation to work. A reader might care because reliable image-to-video tools could simplify content creation in film, design, and education without constant manual fixes for drift or quality loss.

Core claim

The central claim is that decoupling semantic accuracy from qualitative factors through a cascaded I2VGen-XL approach, with static images serving as crucial guidance and two hierarchical encoders in the base stage, produces videos that simultaneously achieve coherent semantics, content preservation from the input image, enhanced detail continuity, and improved clarity at 1280x720 resolution after the refinement stage.

What carries the argument

The two-stage cascaded diffusion model: the base stage uses two hierarchical encoders to guarantee coherent semantics and preserve content from the input image, while the refinement stage incorporates brief text to enhance details and raise resolution.

If this is right

  • Videos maintain tighter alignment between input image content and generated frames across the sequence.
  • Spatio-temporal continuity improves, reducing jerky motion and detail flicker.
  • Higher-resolution output at 1280x720 becomes standard without separate upsampling steps.
  • Training on tens of millions of text-video pairs increases output diversity while keeping semantic fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same base-plus-refinement split could extend to other conditioned generation tasks such as image-to-3D or text-to-audio.
  • Further scaling of aligned image-video datasets would likely relax the need for perfect motion alignment during collection.
  • If the refinement stage can run independently, the method may support interactive editing of generated clips.

Load-bearing premise

Static images used as guidance plus the two hierarchical encoders will reliably preserve content and semantics without introducing new artifacts or drift, even when the input image and target motion are not perfectly aligned in the collected data.

What would settle it

Generate videos from input images containing complex or mismatched motions and check whether the outputs exhibit semantic drift, loss of fine details, or visible artifacts relative to the source image.

read the original abstract

Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280$\times$720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at \url{https://i2vgen-xl.github.io}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes I2VGen-XL, a cascaded diffusion model for image-to-video synthesis. It decouples semantic accuracy from visual quality via two stages: a base stage that employs two hierarchical encoders together with static-image guidance to preserve content and ensure coherent semantics, and a refinement stage that adds brief text conditioning to boost detail and resolution to 1280×720. Training relies on a newly collected corpus of ~35 million single-shot text-video pairs plus 6 billion text-image pairs; the central claim is that this architecture simultaneously improves semantic accuracy, spatio-temporal continuity, and clarity relative to prior methods.

Significance. If the quantitative claims hold, the cascaded design and large-scale single-shot data collection would represent a practical advance in controllable video synthesis, particularly for applications requiring faithful image-to-video translation. Public release of code and models would strengthen reproducibility and enable direct comparisons.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (base-stage description): the claim that the two hierarchical encoders plus static-image guidance 'guarantee coherent semantics and preserve content' is load-bearing for the central contribution, yet the text provides no explicit alignment loss, content-consistency regularizer, or misalignment-robust training procedure. If the encoders rely only on standard diffusion conditioning, any mismatch between the guidance image and the motion statistics in the collected clips can produce semantic drift that the cascaded pipeline does not automatically correct.
  2. [§4] §4 (experiments): the abstract asserts performance gains in semantic accuracy, continuity, and clarity, but the provided text supplies no quantitative metrics (FVD, FID, CLIP similarity, user-study scores), ablation tables, or error bars. Without these numbers it is impossible to verify that the hierarchical encoders and refinement stage deliver the claimed simultaneous improvements rather than trading one quality for another.
minor comments (2)
  1. [Dataset collection paragraph] Clarify the exact definition of 'single-shot' for the 35 M text-video pairs and whether any filtering was applied to ensure motion-image alignment.
  2. [Figures 2-3] Figure captions and architecture diagrams should explicitly label the two hierarchical encoders and the conditioning pathways from the static image.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address the concerns regarding the base-stage mechanisms and the presentation of experimental results. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (base-stage description): the claim that the two hierarchical encoders plus static-image guidance 'guarantee coherent semantics and preserve content' is load-bearing for the central contribution, yet the text provides no explicit alignment loss, content-consistency regularizer, or misalignment-robust training procedure. If the encoders rely only on standard diffusion conditioning, any mismatch between the guidance image and the motion statistics in the collected clips can produce semantic drift that the cascaded pipeline does not automatically correct.

    Authors: We appreciate this observation. The base stage conditions the diffusion model on multi-scale features extracted by the two hierarchical encoders from the input image, combined with direct static-image guidance to anchor content. This conditioning is applied throughout the denoising process rather than relying solely on standard text conditioning. While no auxiliary alignment loss is introduced beyond the diffusion objective, the architecture and large-scale training data are intended to promote semantic consistency. We acknowledge that mismatches in motion statistics could still lead to drift in edge cases. In the revised manuscript we have expanded §3 with a clearer description of the conditioning pathway and added a limitations paragraph discussing potential semantic drift. revision: yes

  2. Referee: [§4] §4 (experiments): the abstract asserts performance gains in semantic accuracy, continuity, and clarity, but the provided text supplies no quantitative metrics (FVD, FID, CLIP similarity, user-study scores), ablation tables, or error bars. Without these numbers it is impossible to verify that the hierarchical encoders and refinement stage deliver the claimed simultaneous improvements rather than trading one quality for another.

    Authors: We agree that explicit quantitative evidence is necessary to substantiate the claims. The original submission contained experimental results and ablations, but these were not presented with sufficient prominence or numerical detail. In the revised version we have reorganized §4 to include tables reporting FVD, FID, CLIP similarity scores, and user-study results with error bars, together with ablation studies isolating the contribution of the hierarchical encoders and the refinement stage. These additions allow direct verification that the cascaded design improves all three aspects simultaneously. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical cascaded architecture with data collection is self-contained

full rationale

The paper describes an empirical proposal for a two-stage cascaded diffusion model (base stage using two hierarchical encoders plus static image guidance for semantics and content preservation; refinement stage for detail and resolution) trained on newly collected 35M text-video pairs and 6B text-image pairs. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Claims of simultaneous improvement in semantic accuracy, continuity, and clarity rest on architecture design choices and experimental comparisons rather than any load-bearing loop back to the inputs themselves. This is a standard self-contained empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of hierarchical encoders for content preservation and on the assumption that the newly collected 35 M video and 6 B image pairs are sufficiently aligned and diverse; these are not independently verified in the abstract.

free parameters (1)
  • diffusion model hyperparameters and training schedule
    Standard but unspecified parameters that control noise schedule, learning rate, and conditioning strength in both stages.
axioms (1)
  • domain assumption Diffusion models conditioned on images and text can produce temporally coherent video when trained on large aligned datasets
    Invoked implicitly when the base stage is said to guarantee coherent semantics.

pith-pipeline@v0.9.0 · 5816 in / 1312 out tokens · 37574 ms · 2026-05-17T23:47:40.179767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

    cs.CV 2026-05 unverdicted novelty 7.0

    VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

  2. Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Immune2V immunizes images against dual-stream I2V generation by enforcing temporally balanced latent divergence and aligning generative features to a precomputed collapse trajectory, yielding stronger persistent degra...

  3. VACE: All-in-One Video Creation and Editing

    cs.CV 2025-03 unverdicted novelty 7.0

    VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.

  4. Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

    cs.CV 2026-05 unverdicted novelty 6.0

    Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.

  5. SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SwiftI2V achieves comparable 2K I2V quality to end-to-end models on VBench-I2V while cutting GPU time by 202x through low-resolution motion planning followed by strongly image-conditioned segment-wise high-resolution ...

  6. SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SwiftI2V matches end-to-end 2K I2V quality on VBench while cutting GPU time by 202x via conditional segment-wise generation that bounds token cost and preserves input fidelity.

  7. PhysLayer: Language-Guided Layered Animation with Depth-Aware Physics

    cs.CV 2026-04 unverdicted novelty 6.0

    PhysLayer is a framework that decomposes images into depth layers, simulates physics with depth awareness, and synthesizes videos guided by language for more plausible animations.

  8. Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos

    cs.CV 2026-04 unverdicted novelty 6.0

    EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.

  9. Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

    cs.CV 2026-02 unverdicted novelty 6.0

    Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.

  10. Video Generators are Robot Policies

    cs.RO 2025-08 conditional novelty 6.0

    Training models to generate videos of robot actions produces policies that generalize better to new objects and tasks while using far less demonstration data than standard behavior cloning.

  11. LTX-Video: Realtime Video Latent Diffusion

    cs.CV 2024-12 conditional novelty 6.0

    LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.

  12. CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.

  13. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    cs.CV 2023-11 conditional novelty 6.0

    Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...

  14. Lightning Unified Video Editing via In-Context Sparse Attention

    cs.CV 2026-05 unverdicted novelty 5.0

    ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...

  15. Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    PASA uses curvature-aware dynamic budgeting, grouped approximations, and stochastic attention routing to accelerate video diffusion transformers while eliminating temporal flickering from sparse patterns.

  16. Movie Gen: A Cast of Media Foundation Models

    cs.CV 2024-10 unverdicted novelty 5.0

    A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

  17. Show-o2: Improved Native Unified Multimodal Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 16 Pith papers · 15 internal anchors

  1. [1]

    https://huggingface

    Zeroscope-XL text-to-video. https://huggingface. co/spaces/fffiloni/zeroscope. 2023. 3

  2. [2]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021. 5

  3. [3]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575,

  4. [4]

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023. 3

  5. [5]

    Control-a-video: Controllable text-to-video generation with diffusion models

    Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023. 3

  6. [6]

    Video controlnet: Towards temporally consistent synthetic-to-real video translation using conditional image diffusion models

    Ernie Chu, Shuo-Yen Lin, and Jun-Cheng Chen. Video controlnet: Towards temporally consistent synthetic-to-real video translation using conditional image diffusion models. arXiv preprint arXiv:2305.19193, 2023. 3

  7. [7]

    Diffusion models beat GANs on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. NeurIPS, pages 8780–8794,

  8. [8]

    Structure and content-guided video synthesis with diffusion models.arXiv preprint arXiv:2302.03011,

    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023. 3, 6

  9. [9]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, pages 12873–12883, 2021. 3

  10. [10]

    Testing the manifold hypothesis

    Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, pages 983–1049, 2016. 2

  11. [11]

    Generative adversarial networks

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Commu- nications of the ACM, pages 139–144, 2020. 2

  12. [12]

    Flexible diffusion modeling of long videos

    William Harvey, Saeid Naderiparizi, Vaden Masrani, Chris- tian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495, 2022. 3

  13. [13]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els. arXiv preprint arXiv:2210.02303, 2022. 2, 3

  14. [14]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, pages 6840–6851,

  15. [15]

    Video Diffusion Models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458 , 2022. 3

  16. [16]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022. 3

  17. [17]

    Diffusion models for video prediction and infilling

    Tobias H ¨oppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022. 3

  18. [18]

    Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet

    Zhihao Hu and Dong Xu. Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet. arXiv preprint arXiv:2307.14073 ,

  19. [19]

    Riemannian diffusion models

    Chin-Wei Huang, Milad Aghajohari, Joey Bose, Prakash Panangaden, and Aaron C Courville. Riemannian diffusion models. NeurIPS, pages 2750–2761, 2022. 2

  20. [20]

    Composer: Creative and controllable image synthesis with composable conditions

    Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023. 2, 3

  21. [21]

    Imagic: Text-based real image editing with diffusion models

    Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022. 3

  22. [22]

    Variational diffusion models

    Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. NeurIPS, pages 21696– 21707, 2021. 2

  23. [23]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 2

  24. [24]

    Pseudo numerical methods for diffusion models on manifolds

    Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. In ICLR, 2022. 2

  25. [25]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 5

  26. [26]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. NeurIPS, 35:5775–5787, 2022. 5

  27. [27]

    Videofusion: Decomposed diffusion models for high-quality video generation

    Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR, 2023. 2, 3

  28. [28]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equa- tions. arXiv preprint arXiv:2108.01073, 2021. 2, 4

  29. [29]

    Codef: Content deformation fields for temporally consistent video processing

    Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Jun- tao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, and Yujun Shen. Codef: Content deformation fields for temporally consistent video processing. arXiv preprint arXiv:2308.07926, 2023. 2

  30. [30]

    Pika Lab discord server

    PikaLab. Pika Lab discord server. https://www.pika. art. 2023. 6

  31. [31]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion mod- els for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 2

  32. [32]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In ICML, pages 8748–8763, 2021. 2

  33. [33]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, pages 5485–5551, 2020. 2

  34. [34]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents. arXiv preprint arXiv:2204.06125,

  35. [35]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022. 2, 3

  36. [36]

    Photorealistic text-to-image diffusion models with deep language understanding.NeurIPS, pages 36479–36494,

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.NeurIPS, pages 36479–36494,

  37. [37]

    Progressive distillation for fast sampling of diffusion models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In ICLR, 2022. 2

  38. [38]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 5

  39. [39]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 ,

  40. [40]

    Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2

    Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elho- seiny. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR, pages 3626–3636, 2022. 3

  41. [41]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICCV, pages 2256– 2265, 2015. 2

  42. [42]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 2

  43. [43]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020. 2

  44. [44]

    ModelScope Text-to-Video Technical Report

    Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023. 3, 4

  45. [45]

    Facecom- poser: A unified model for versatile facial content creation

    Jiayu Wang, Kang Zhao, Yifeng Ma, Shiwei Zhang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Facecom- poser: A unified model for versatile facial content creation. In NeurIPS, 2023. 2

  46. [46]

    Videocomposer: Compositional video synthesis with motion controllability

    Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jin- gren Zhou. Videocomposer: Compositional video synthesis with motion controllability. NeurIPS, 2023. 2, 3, 4

  47. [47]

    Lavie: High-quality video generation with cascaded latent diffusion models

    Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023. 3

  48. [48]

    Learning fast samplers for diffusion models by differentiating through sample quality

    Daniel Watson, William Chan, Jonathan Ho, and Moham- mad Norouzi. Learning fast samplers for diffusion models by differentiating through sample quality. In ICLR, 2022. 2

  49. [49]

    Make-your-video: Customized video generation using textual and structural guidance.arXiv preprint arXiv:2306.00943, 2023

    Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, et al. Make-your-video: Customized video generation using textual and structural guidance.arXiv preprint arXiv:2306.00943, 2023. 3

  50. [50]

    Dynamicrafter: Animating open-domain images with video diffusion priors

    Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Xin- tao Wang, Tien-Tsin Wong, and Ying Shan. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023. 3

  51. [51]

    Diffusion models: A comprehensive survey of methods and applications

    Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Yingxia Shao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications. arXiv preprint arXiv:2209.00796, 2022. 2

  52. [52]

    Dif- fusion probabilistic modeling for video generation

    Ruihan Yang, Prakhar Srivastava, and Stephan Mandt. Dif- fusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022. 3

  53. [53]

    Drag- nuwa: Fine-grained control in video generation by integrating text, image, and trajectory.arXiv preprint arXiv:2308.08089,

    Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023. 3

  54. [54]

    Generating videos with dynamics-aware implicit generative adversarial net- works

    Sihyun Yu, Jihoon Tack, Sangwoo Mo, Hyunsu Kim, Junho Kim, Jung-Woo Ha, and Jinwoo Shin. Generating videos with dynamics-aware implicit generative adversarial net- works. arXiv preprint arXiv:2202.10571, 2022. 3

  55. [55]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023. 3

  56. [56]

    Fast sampling of diffusion models with exponential integrator

    Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. 2022. 2

  57. [57]

    gddim: Generalized denoising diffusion implicit models

    Qinsheng Zhang, Molei Tao, and Yongxin Chen. gddim: Generalized denoising diffusion implicit models. arXiv preprint arXiv:2206.05564, 2022. 5

  58. [58]

    Sine: Single image editing with text-to-image diffusion models

    Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris N Metaxas, and Jian Ren. Sine: Single image editing with text-to-image diffusion models. In CVPR, pages 6027–6037,

  59. [59]

    Learning to forecast and refine residual motion for image-to-video generation

    Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, and Dimitris Metaxas. Learning to forecast and refine residual motion for image-to-video generation. In ECCV, pages 387– 403, 2018. 3

  60. [60]

    Truncated diffusion probabilistic models

    Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. Truncated diffusion probabilistic models. stat, page 7, 2022. 2

  61. [61]

    MagicVideo: Efficient Video Generation With Latent Diffusion Models

    Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022. 3