arxiv: 2311.04145 · v1 · pith:F5OQYRSNnew · submitted 2023-11-07 · 💻 cs.CV

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

Shiwei Zhang , Jiayu Wang , Yingya Zhang , Kang Zhao , Hangjie Yuan , Zhiwu Qin , Xiang Wang , Deli Zhao

show 1 more author

Jingren Zhou

This is my paper

Pith reviewed 2026-05-17 23:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords image-to-video synthesisdiffusion modelscascaded generationvideo generationsemantic accuracyspatio-temporal continuityhigh-resolution video

0 comments

The pith

A cascaded diffusion model guided by static images generates videos that keep semantic accuracy, detail continuity, and clarity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces I2VGen-XL to address persistent issues in turning a single image into a video, such as lost meaning, blurry frames, and jerky motion. It splits the generation into a base stage that locks in the input image's content and meaning using two hierarchical encoders, followed by a refinement stage that adds fine details and raises resolution with a short text prompt. Large training sets of 35 million text-video pairs and 6 billion text-image pairs supply the scale needed for this separation to work. A reader might care because reliable image-to-video tools could simplify content creation in film, design, and education without constant manual fixes for drift or quality loss.

Core claim

The central claim is that decoupling semantic accuracy from qualitative factors through a cascaded I2VGen-XL approach, with static images serving as crucial guidance and two hierarchical encoders in the base stage, produces videos that simultaneously achieve coherent semantics, content preservation from the input image, enhanced detail continuity, and improved clarity at 1280x720 resolution after the refinement stage.

What carries the argument

The two-stage cascaded diffusion model: the base stage uses two hierarchical encoders to guarantee coherent semantics and preserve content from the input image, while the refinement stage incorporates brief text to enhance details and raise resolution.

If this is right

Videos maintain tighter alignment between input image content and generated frames across the sequence.
Spatio-temporal continuity improves, reducing jerky motion and detail flicker.
Higher-resolution output at 1280x720 becomes standard without separate upsampling steps.
Training on tens of millions of text-video pairs increases output diversity while keeping semantic fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same base-plus-refinement split could extend to other conditioned generation tasks such as image-to-3D or text-to-audio.
Further scaling of aligned image-video datasets would likely relax the need for perfect motion alignment during collection.
If the refinement stage can run independently, the method may support interactive editing of generated clips.

Load-bearing premise

Static images used as guidance plus the two hierarchical encoders will reliably preserve content and semantics without introducing new artifacts or drift, even when the input image and target motion are not perfectly aligned in the collected data.

What would settle it

Generate videos from input images containing complex or mismatched motions and check whether the outputs exhibit semantic drift, loss of fine details, or visible artifacts relative to the source image.

read the original abstract

Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280$\times$720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at \url{https://i2vgen-xl.github.io}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

I2VGen-XL puts forward a two-stage cascaded diffusion setup with hierarchical encoders for image-to-video, backed by large-scale data collection, but the abstract supplies no metrics or ablations to support the performance claims.

read the letter

The core idea is a base stage that uses two hierarchical encoders plus static image guidance to lock in semantics and content, followed by a refinement stage that adds text conditioning and ups resolution to 1280x720. They also gathered 35 million single-shot text-video pairs and 6 billion text-image pairs to train on. That data scale is concrete work and directly targets the scarcity issue they flag in the abstract. The cascaded split is a straightforward architectural choice that separates the hard parts of coherence from detail enhancement, and promising public code release is useful for anyone who wants to try it out.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes I2VGen-XL, a cascaded diffusion model for image-to-video synthesis. It decouples semantic accuracy from visual quality via two stages: a base stage that employs two hierarchical encoders together with static-image guidance to preserve content and ensure coherent semantics, and a refinement stage that adds brief text conditioning to boost detail and resolution to 1280×720. Training relies on a newly collected corpus of ~35 million single-shot text-video pairs plus 6 billion text-image pairs; the central claim is that this architecture simultaneously improves semantic accuracy, spatio-temporal continuity, and clarity relative to prior methods.

Significance. If the quantitative claims hold, the cascaded design and large-scale single-shot data collection would represent a practical advance in controllable video synthesis, particularly for applications requiring faithful image-to-video translation. Public release of code and models would strengthen reproducibility and enable direct comparisons.

major comments (2)

[Abstract and §3] Abstract and §3 (base-stage description): the claim that the two hierarchical encoders plus static-image guidance 'guarantee coherent semantics and preserve content' is load-bearing for the central contribution, yet the text provides no explicit alignment loss, content-consistency regularizer, or misalignment-robust training procedure. If the encoders rely only on standard diffusion conditioning, any mismatch between the guidance image and the motion statistics in the collected clips can produce semantic drift that the cascaded pipeline does not automatically correct.
[§4] §4 (experiments): the abstract asserts performance gains in semantic accuracy, continuity, and clarity, but the provided text supplies no quantitative metrics (FVD, FID, CLIP similarity, user-study scores), ablation tables, or error bars. Without these numbers it is impossible to verify that the hierarchical encoders and refinement stage deliver the claimed simultaneous improvements rather than trading one quality for another.

minor comments (2)

[Dataset collection paragraph] Clarify the exact definition of 'single-shot' for the 35 M text-video pairs and whether any filtering was applied to ensure motion-image alignment.
[Figures 2-3] Figure captions and architecture diagrams should explicitly label the two hierarchical encoders and the conditioning pathways from the static image.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address the concerns regarding the base-stage mechanisms and the presentation of experimental results. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (base-stage description): the claim that the two hierarchical encoders plus static-image guidance 'guarantee coherent semantics and preserve content' is load-bearing for the central contribution, yet the text provides no explicit alignment loss, content-consistency regularizer, or misalignment-robust training procedure. If the encoders rely only on standard diffusion conditioning, any mismatch between the guidance image and the motion statistics in the collected clips can produce semantic drift that the cascaded pipeline does not automatically correct.

Authors: We appreciate this observation. The base stage conditions the diffusion model on multi-scale features extracted by the two hierarchical encoders from the input image, combined with direct static-image guidance to anchor content. This conditioning is applied throughout the denoising process rather than relying solely on standard text conditioning. While no auxiliary alignment loss is introduced beyond the diffusion objective, the architecture and large-scale training data are intended to promote semantic consistency. We acknowledge that mismatches in motion statistics could still lead to drift in edge cases. In the revised manuscript we have expanded §3 with a clearer description of the conditioning pathway and added a limitations paragraph discussing potential semantic drift. revision: yes
Referee: [§4] §4 (experiments): the abstract asserts performance gains in semantic accuracy, continuity, and clarity, but the provided text supplies no quantitative metrics (FVD, FID, CLIP similarity, user-study scores), ablation tables, or error bars. Without these numbers it is impossible to verify that the hierarchical encoders and refinement stage deliver the claimed simultaneous improvements rather than trading one quality for another.

Authors: We agree that explicit quantitative evidence is necessary to substantiate the claims. The original submission contained experimental results and ablations, but these were not presented with sufficient prominence or numerical detail. In the revised version we have reorganized §4 to include tables reporting FVD, FID, CLIP similarity scores, and user-study results with error bars, together with ablation studies isolating the contribution of the hierarchical encoders and the refinement stage. These additions allow direct verification that the cascaded design improves all three aspects simultaneously. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical cascaded architecture with data collection is self-contained

full rationale

The paper describes an empirical proposal for a two-stage cascaded diffusion model (base stage using two hierarchical encoders plus static image guidance for semantics and content preservation; refinement stage for detail and resolution) trained on newly collected 35M text-video pairs and 6B text-image pairs. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Claims of simultaneous improvement in semantic accuracy, continuity, and clarity rest on architecture design choices and experimental comparisons rather than any load-bearing loop back to the inputs themselves. This is a standard self-contained empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of hierarchical encoders for content preservation and on the assumption that the newly collected 35 M video and 6 B image pairs are sufficiently aligned and diverse; these are not independently verified in the abstract.

free parameters (1)

diffusion model hyperparameters and training schedule
Standard but unspecified parameters that control noise schedule, learning rate, and conditioning strength in both stages.

axioms (1)

domain assumption Diffusion models conditioned on images and text can produce temporally coherent video when trained on large aligned datasets
Invoked implicitly when the base stage is said to guarantee coherent semantics.

pith-pipeline@v0.9.0 · 5816 in / 1312 out tokens · 37574 ms · 2026-05-17T23:47:40.179767+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
cs.CV 2026-05 unverdicted novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Immune2V immunizes images against dual-stream I2V generation by enforcing temporally balanced latent divergence and aligning generative features to a precomputed collapse trajectory, yielding stronger persistent degra...
VACE: All-in-One Video Creation and Editing
cs.CV 2025-03 unverdicted novelty 7.0

VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
cs.CV 2026-05 unverdicted novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
cs.CV 2026-05 unverdicted novelty 6.0

SwiftI2V achieves comparable 2K I2V quality to end-to-end models on VBench-I2V while cutting GPU time by 202x through low-resolution motion planning followed by strongly image-conditioned segment-wise high-resolution ...
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
cs.CV 2026-05 unverdicted novelty 6.0

SwiftI2V matches end-to-end 2K I2V quality on VBench while cutting GPU time by 202x via conditional segment-wise generation that bounds token cost and preserves input fidelity.
PhysLayer: Language-Guided Layered Animation with Depth-Aware Physics
cs.CV 2026-04 unverdicted novelty 6.0

PhysLayer is a framework that decomposes images into depth layers, simulates physics with depth awareness, and synthesizes videos guided by language for more plausible animations.
Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos
cs.CV 2026-04 unverdicted novelty 6.0

EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
cs.CV 2026-02 unverdicted novelty 6.0

Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
Video Generators are Robot Policies
cs.RO 2025-08 conditional novelty 6.0

Training models to generate videos of robot actions produces policies that generalize better to new objects and tasks while using far less demonstration data than standard behavior cloning.
LTX-Video: Realtime Video Latent Diffusion
cs.CV 2024-12 conditional novelty 6.0

LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
cs.CV 2024-04 unverdicted novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
cs.CV 2023-11 conditional novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
Lightning Unified Video Editing via In-Context Sparse Attention
cs.CV 2026-05 unverdicted novelty 5.0

ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...
Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

PASA uses curvature-aware dynamic budgeting, grouped approximations, and stochastic attention routing to accelerate video diffusion transformers while eliminating temporal flickering from sparse patterns.
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
Show-o2: Improved Native Unified Multimodal Models
cs.CV 2025-06 unverdicted novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 16 Pith papers · 15 internal anchors

[1]

https://huggingface

Zeroscope-XL text-to-video. https://huggingface. co/spaces/fffiloni/zeroscope. 2023. 3

work page 2023
[2]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021. 5

work page 2021
[3]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575,

work page
[4]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Control-a-video: Controllable text-to-video generation with diffusion models

Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023. 3

work page arXiv 2023
[6]

Video controlnet: Towards temporally consistent synthetic-to-real video translation using conditional image diffusion models

Ernie Chu, Shuo-Yen Lin, and Jun-Cheng Chen. Video controlnet: Towards temporally consistent synthetic-to-real video translation using conditional image diffusion models. arXiv preprint arXiv:2305.19193, 2023. 3

work page arXiv 2023
[7]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. NeurIPS, pages 8780–8794,

work page
[8]

Structure and content-guided video synthesis with diffusion models.arXiv preprint arXiv:2302.03011,

Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023. 3, 6

work page arXiv 2023
[9]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, pages 12873–12883, 2021. 3

work page 2021
[10]

Testing the manifold hypothesis

Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, pages 983–1049, 2016. 2

work page 2016
[11]

Generative adversarial networks

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Commu- nications of the ACM, pages 139–144, 2020. 2

work page 2020
[12]

Flexible diffusion modeling of long videos

William Harvey, Saeid Naderiparizi, Vaden Masrani, Chris- tian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495, 2022. 3

work page arXiv 2022
[13]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els. arXiv preprint arXiv:2210.02303, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, pages 6840–6851,

work page
[15]

Video Diffusion Models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458 , 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Diffusion models for video prediction and infilling

Tobias H ¨oppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022. 3

work page arXiv 2022
[18]

Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet

Zhihao Hu and Dong Xu. Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet. arXiv preprint arXiv:2307.14073 ,

work page arXiv
[19]

Riemannian diffusion models

Chin-Wei Huang, Milad Aghajohari, Joey Bose, Prakash Panangaden, and Aaron C Courville. Riemannian diffusion models. NeurIPS, pages 2750–2761, 2022. 2

work page 2022
[20]

Composer: Creative and controllable image synthesis with composable conditions

Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023. 2, 3

work page arXiv 2023
[21]

Imagic: Text-based real image editing with diffusion models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022. 3

work page arXiv 2022
[22]

Variational diffusion models

Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. NeurIPS, pages 21696– 21707, 2021. 2

work page 2021
[23]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 2

work page internal anchor Pith review Pith/arXiv arXiv 2013
[24]

Pseudo numerical methods for diffusion models on manifolds

Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. In ICLR, 2022. 2

work page 2022
[25]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. NeurIPS, 35:5775–5787, 2022. 5

work page 2022
[27]

Videofusion: Decomposed diffusion models for high-quality video generation

Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR, 2023. 2, 3

work page 2023
[28]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equa- tions. arXiv preprint arXiv:2108.01073, 2021. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

Codef: Content deformation fields for temporally consistent video processing

Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Jun- tao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, and Yujun Shen. Codef: Content deformation fields for temporally consistent video processing. arXiv preprint arXiv:2308.07926, 2023. 2

work page arXiv 2023
[30]

Pika Lab discord server

PikaLab. Pika Lab discord server. https://www.pika. art. 2023. 6

work page 2023
[31]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion mod- els for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In ICML, pages 8748–8763, 2021. 2

work page 2021
[33]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, pages 5485–5551, 2020. 2

work page 2020
[34]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents. arXiv preprint arXiv:2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022. 2, 3

work page 2022
[36]

Photorealistic text-to-image diffusion models with deep language understanding.NeurIPS, pages 36479–36494,

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.NeurIPS, pages 36479–36494,

work page
[37]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In ICLR, 2022. 2

work page 2022
[38]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 5

work page internal anchor Pith review Pith/arXiv arXiv 2021
[39]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 ,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2

Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elho- seiny. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR, pages 3626–3636, 2022. 3

work page 2022
[41]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICCV, pages 2256– 2265, 2015. 2

work page 2015
[42]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2010
[43]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2011
[44]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Facecom- poser: A unified model for versatile facial content creation

Jiayu Wang, Kang Zhao, Yifeng Ma, Shiwei Zhang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Facecom- poser: A unified model for versatile facial content creation. In NeurIPS, 2023. 2

work page 2023
[46]

Videocomposer: Compositional video synthesis with motion controllability

Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jin- gren Zhou. Videocomposer: Compositional video synthesis with motion controllability. NeurIPS, 2023. 2, 3, 4

work page 2023
[47]

Lavie: High-quality video generation with cascaded latent diffusion models

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023. 3

work page arXiv 2023
[48]

Learning fast samplers for diffusion models by differentiating through sample quality

Daniel Watson, William Chan, Jonathan Ho, and Moham- mad Norouzi. Learning fast samplers for diffusion models by differentiating through sample quality. In ICLR, 2022. 2

work page 2022
[49]

Make-your-video: Customized video generation using textual and structural guidance.arXiv preprint arXiv:2306.00943, 2023

Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, et al. Make-your-video: Customized video generation using textual and structural guidance.arXiv preprint arXiv:2306.00943, 2023. 3

work page arXiv 2023
[50]

Dynamicrafter: Animating open-domain images with video diffusion priors

Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Xin- tao Wang, Tien-Tsin Wong, and Ying Shan. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023. 3

work page arXiv 2023
[51]

Diffusion models: A comprehensive survey of methods and applications

Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Yingxia Shao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications. arXiv preprint arXiv:2209.00796, 2022. 2

work page arXiv 2022
[52]

Dif- fusion probabilistic modeling for video generation

Ruihan Yang, Prakhar Srivastava, and Stephan Mandt. Dif- fusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022. 3

work page arXiv 2022
[53]

Drag- nuwa: Fine-grained control in video generation by integrating text, image, and trajectory.arXiv preprint arXiv:2308.08089,

Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023. 3

work page arXiv 2023
[54]

Generating videos with dynamics-aware implicit generative adversarial net- works

Sihyun Yu, Jihoon Tack, Sangwoo Mo, Hyunsu Kim, Junho Kim, Jung-Woo Ha, and Jinwoo Shin. Generating videos with dynamics-aware implicit generative adversarial net- works. arXiv preprint arXiv:2202.10571, 2022. 3

work page arXiv 2022
[55]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023. 3

work page 2023
[56]

Fast sampling of diffusion models with exponential integrator

Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. 2022. 2

work page 2022
[57]

gddim: Generalized denoising diffusion implicit models

Qinsheng Zhang, Molei Tao, and Yongxin Chen. gddim: Generalized denoising diffusion implicit models. arXiv preprint arXiv:2206.05564, 2022. 5

work page arXiv 2022
[58]

Sine: Single image editing with text-to-image diffusion models

Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris N Metaxas, and Jian Ren. Sine: Single image editing with text-to-image diffusion models. In CVPR, pages 6027–6037,

work page
[59]

Learning to forecast and refine residual motion for image-to-video generation

Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, and Dimitris Metaxas. Learning to forecast and refine residual motion for image-to-video generation. In ECCV, pages 387– 403, 2018. 3

work page 2018
[60]

Truncated diffusion probabilistic models

Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. Truncated diffusion probabilistic models. stat, page 7, 2022. 2

work page 2022
[61]

MagicVideo: Efficient Video Generation With Latent Diffusion Models

Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022