I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
Pith reviewed 2026-05-17 23:47 UTC · model grok-4.3
The pith
A cascaded diffusion model guided by static images generates videos that keep semantic accuracy, detail continuity, and clarity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that decoupling semantic accuracy from qualitative factors through a cascaded I2VGen-XL approach, with static images serving as crucial guidance and two hierarchical encoders in the base stage, produces videos that simultaneously achieve coherent semantics, content preservation from the input image, enhanced detail continuity, and improved clarity at 1280x720 resolution after the refinement stage.
What carries the argument
The two-stage cascaded diffusion model: the base stage uses two hierarchical encoders to guarantee coherent semantics and preserve content from the input image, while the refinement stage incorporates brief text to enhance details and raise resolution.
If this is right
- Videos maintain tighter alignment between input image content and generated frames across the sequence.
- Spatio-temporal continuity improves, reducing jerky motion and detail flicker.
- Higher-resolution output at 1280x720 becomes standard without separate upsampling steps.
- Training on tens of millions of text-video pairs increases output diversity while keeping semantic fidelity.
Where Pith is reading between the lines
- The same base-plus-refinement split could extend to other conditioned generation tasks such as image-to-3D or text-to-audio.
- Further scaling of aligned image-video datasets would likely relax the need for perfect motion alignment during collection.
- If the refinement stage can run independently, the method may support interactive editing of generated clips.
Load-bearing premise
Static images used as guidance plus the two hierarchical encoders will reliably preserve content and semantics without introducing new artifacts or drift, even when the input image and target motion are not perfectly aligned in the collected data.
What would settle it
Generate videos from input images containing complex or mismatched motions and check whether the outputs exhibit semantic drift, loss of fine details, or visible artifacts relative to the source image.
read the original abstract
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280$\times$720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at \url{https://i2vgen-xl.github.io}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes I2VGen-XL, a cascaded diffusion model for image-to-video synthesis. It decouples semantic accuracy from visual quality via two stages: a base stage that employs two hierarchical encoders together with static-image guidance to preserve content and ensure coherent semantics, and a refinement stage that adds brief text conditioning to boost detail and resolution to 1280×720. Training relies on a newly collected corpus of ~35 million single-shot text-video pairs plus 6 billion text-image pairs; the central claim is that this architecture simultaneously improves semantic accuracy, spatio-temporal continuity, and clarity relative to prior methods.
Significance. If the quantitative claims hold, the cascaded design and large-scale single-shot data collection would represent a practical advance in controllable video synthesis, particularly for applications requiring faithful image-to-video translation. Public release of code and models would strengthen reproducibility and enable direct comparisons.
major comments (2)
- [Abstract and §3] Abstract and §3 (base-stage description): the claim that the two hierarchical encoders plus static-image guidance 'guarantee coherent semantics and preserve content' is load-bearing for the central contribution, yet the text provides no explicit alignment loss, content-consistency regularizer, or misalignment-robust training procedure. If the encoders rely only on standard diffusion conditioning, any mismatch between the guidance image and the motion statistics in the collected clips can produce semantic drift that the cascaded pipeline does not automatically correct.
- [§4] §4 (experiments): the abstract asserts performance gains in semantic accuracy, continuity, and clarity, but the provided text supplies no quantitative metrics (FVD, FID, CLIP similarity, user-study scores), ablation tables, or error bars. Without these numbers it is impossible to verify that the hierarchical encoders and refinement stage deliver the claimed simultaneous improvements rather than trading one quality for another.
minor comments (2)
- [Dataset collection paragraph] Clarify the exact definition of 'single-shot' for the 35 M text-video pairs and whether any filtering was applied to ensure motion-image alignment.
- [Figures 2-3] Figure captions and architecture diagrams should explicitly label the two hierarchical encoders and the conditioning pathways from the static image.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address the concerns regarding the base-stage mechanisms and the presentation of experimental results. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (base-stage description): the claim that the two hierarchical encoders plus static-image guidance 'guarantee coherent semantics and preserve content' is load-bearing for the central contribution, yet the text provides no explicit alignment loss, content-consistency regularizer, or misalignment-robust training procedure. If the encoders rely only on standard diffusion conditioning, any mismatch between the guidance image and the motion statistics in the collected clips can produce semantic drift that the cascaded pipeline does not automatically correct.
Authors: We appreciate this observation. The base stage conditions the diffusion model on multi-scale features extracted by the two hierarchical encoders from the input image, combined with direct static-image guidance to anchor content. This conditioning is applied throughout the denoising process rather than relying solely on standard text conditioning. While no auxiliary alignment loss is introduced beyond the diffusion objective, the architecture and large-scale training data are intended to promote semantic consistency. We acknowledge that mismatches in motion statistics could still lead to drift in edge cases. In the revised manuscript we have expanded §3 with a clearer description of the conditioning pathway and added a limitations paragraph discussing potential semantic drift. revision: yes
-
Referee: [§4] §4 (experiments): the abstract asserts performance gains in semantic accuracy, continuity, and clarity, but the provided text supplies no quantitative metrics (FVD, FID, CLIP similarity, user-study scores), ablation tables, or error bars. Without these numbers it is impossible to verify that the hierarchical encoders and refinement stage deliver the claimed simultaneous improvements rather than trading one quality for another.
Authors: We agree that explicit quantitative evidence is necessary to substantiate the claims. The original submission contained experimental results and ablations, but these were not presented with sufficient prominence or numerical detail. In the revised version we have reorganized §4 to include tables reporting FVD, FID, CLIP similarity scores, and user-study results with error bars, together with ablation studies isolating the contribution of the hierarchical encoders and the refinement stage. These additions allow direct verification that the cascaded design improves all three aspects simultaneously. revision: yes
Circularity Check
No circularity: empirical cascaded architecture with data collection is self-contained
full rationale
The paper describes an empirical proposal for a two-stage cascaded diffusion model (base stage using two hierarchical encoders plus static image guidance for semantics and content preservation; refinement stage for detail and resolution) trained on newly collected 35M text-video pairs and 6B text-image pairs. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Claims of simultaneous improvement in semantic accuracy, continuity, and clarity rest on architecture design choices and experimental comparisons rather than any load-bearing loop back to the inputs themselves. This is a standard self-contained empirical contribution.
Axiom & Free-Parameter Ledger
free parameters (1)
- diffusion model hyperparameters and training schedule
axioms (1)
- domain assumption Diffusion models conditioned on images and text can produce temporally coherent video when trained on large aligned datasets
Forward citations
Cited by 17 Pith papers
-
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
-
Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation
Immune2V immunizes images against dual-stream I2V generation by enforcing temporally balanced latent divergence and aligning generative features to a precomputed collapse trajectory, yielding stronger persistent degra...
-
VACE: All-in-One Video Creation and Editing
VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.
-
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
-
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
SwiftI2V achieves comparable 2K I2V quality to end-to-end models on VBench-I2V while cutting GPU time by 202x through low-resolution motion planning followed by strongly image-conditioned segment-wise high-resolution ...
-
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
SwiftI2V matches end-to-end 2K I2V quality on VBench while cutting GPU time by 202x via conditional segment-wise generation that bounds token cost and preserves input fidelity.
-
PhysLayer: Language-Guided Layered Animation with Depth-Aware Physics
PhysLayer is a framework that decomposes images into depth layers, simulates physics with depth awareness, and synthesizes videos guided by language for more plausible animations.
-
Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos
EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.
-
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
-
Video Generators are Robot Policies
Training models to generate videos of robot actions produces policies that generalize better to new objects and tasks while using far less demonstration data than standard behavior cloning.
-
LTX-Video: Realtime Video Latent Diffusion
LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
-
Lightning Unified Video Editing via In-Context Sparse Attention
ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...
-
Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation
PASA uses curvature-aware dynamic budgeting, grouped approximations, and stochastic attention routing to accelerate video diffusion transformers while eliminating temporal flickering from sparse patterns.
-
Movie Gen: A Cast of Media Foundation Models
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
Reference graph
Works this paper leans on
-
[1]
Zeroscope-XL text-to-video. https://huggingface. co/spaces/fffiloni/zeroscope. 2023. 3
work page 2023
-
[2]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021. 5
work page 2021
-
[3]
Align your latents: High-resolution video synthesis with latent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575,
-
[4]
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Control-a-video: Controllable text-to-video generation with diffusion models
Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023. 3
-
[6]
Ernie Chu, Shuo-Yen Lin, and Jun-Cheng Chen. Video controlnet: Towards temporally consistent synthetic-to-real video translation using conditional image diffusion models. arXiv preprint arXiv:2305.19193, 2023. 3
-
[7]
Diffusion models beat GANs on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. NeurIPS, pages 8780–8794,
-
[8]
Structure and content-guided video synthesis with diffusion models.arXiv preprint arXiv:2302.03011,
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023. 3, 6
-
[9]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, pages 12873–12883, 2021. 3
work page 2021
-
[10]
Testing the manifold hypothesis
Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, pages 983–1049, 2016. 2
work page 2016
-
[11]
Generative adversarial networks
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Commu- nications of the ACM, pages 139–144, 2020. 2
work page 2020
-
[12]
Flexible diffusion modeling of long videos
William Harvey, Saeid Naderiparizi, Vaden Masrani, Chris- tian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495, 2022. 3
-
[13]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els. arXiv preprint arXiv:2210.02303, 2022. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, pages 6840–6851,
-
[15]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458 , 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Diffusion models for video prediction and infilling
Tobias H ¨oppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022. 3
-
[18]
Zhihao Hu and Dong Xu. Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet. arXiv preprint arXiv:2307.14073 ,
-
[19]
Chin-Wei Huang, Milad Aghajohari, Joey Bose, Prakash Panangaden, and Aaron C Courville. Riemannian diffusion models. NeurIPS, pages 2750–2761, 2022. 2
work page 2022
-
[20]
Composer: Creative and controllable image synthesis with composable conditions
Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023. 2, 3
-
[21]
Imagic: Text-based real image editing with diffusion models
Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022. 3
-
[22]
Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. NeurIPS, pages 21696– 21707, 2021. 2
work page 2021
-
[23]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 2
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[24]
Pseudo numerical methods for diffusion models on manifolds
Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. In ICLR, 2022. 2
work page 2022
-
[25]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 5
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[26]
Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. NeurIPS, 35:5775–5787, 2022. 5
work page 2022
-
[27]
Videofusion: Decomposed diffusion models for high-quality video generation
Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR, 2023. 2, 3
work page 2023
-
[28]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equa- tions. arXiv preprint arXiv:2108.01073, 2021. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[29]
Codef: Content deformation fields for temporally consistent video processing
Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Jun- tao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, and Yujun Shen. Codef: Content deformation fields for temporally consistent video processing. arXiv preprint arXiv:2308.07926, 2023. 2
-
[30]
PikaLab. Pika Lab discord server. https://www.pika. art. 2023. 6
work page 2023
-
[31]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion mod- els for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In ICML, pages 8748–8763, 2021. 2
work page 2021
-
[33]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, pages 5485–5551, 2020. 2
work page 2020
-
[34]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents. arXiv preprint arXiv:2204.06125,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022. 2, 3
work page 2022
-
[36]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.NeurIPS, pages 36479–36494,
-
[37]
Progressive distillation for fast sampling of diffusion models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In ICLR, 2022. 2
work page 2022
-
[38]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 5
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[39]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2
Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elho- seiny. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR, pages 3626–3636, 2022. 3
work page 2022
-
[41]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICCV, pages 2256– 2265, 2015. 2
work page 2015
-
[42]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 2
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[43]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020. 2
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[44]
ModelScope Text-to-Video Technical Report
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Facecom- poser: A unified model for versatile facial content creation
Jiayu Wang, Kang Zhao, Yifeng Ma, Shiwei Zhang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Facecom- poser: A unified model for versatile facial content creation. In NeurIPS, 2023. 2
work page 2023
-
[46]
Videocomposer: Compositional video synthesis with motion controllability
Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jin- gren Zhou. Videocomposer: Compositional video synthesis with motion controllability. NeurIPS, 2023. 2, 3, 4
work page 2023
-
[47]
Lavie: High-quality video generation with cascaded latent diffusion models
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023. 3
-
[48]
Learning fast samplers for diffusion models by differentiating through sample quality
Daniel Watson, William Chan, Jonathan Ho, and Moham- mad Norouzi. Learning fast samplers for diffusion models by differentiating through sample quality. In ICLR, 2022. 2
work page 2022
-
[49]
Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, et al. Make-your-video: Customized video generation using textual and structural guidance.arXiv preprint arXiv:2306.00943, 2023. 3
-
[50]
Dynamicrafter: Animating open-domain images with video diffusion priors
Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Xin- tao Wang, Tien-Tsin Wong, and Ying Shan. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023. 3
-
[51]
Diffusion models: A comprehensive survey of methods and applications
Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Yingxia Shao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications. arXiv preprint arXiv:2209.00796, 2022. 2
-
[52]
Dif- fusion probabilistic modeling for video generation
Ruihan Yang, Prakhar Srivastava, and Stephan Mandt. Dif- fusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022. 3
-
[53]
Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023. 3
-
[54]
Generating videos with dynamics-aware implicit generative adversarial net- works
Sihyun Yu, Jihoon Tack, Sangwoo Mo, Hyunsu Kim, Junho Kim, Jung-Woo Ha, and Jinwoo Shin. Generating videos with dynamics-aware implicit generative adversarial net- works. arXiv preprint arXiv:2202.10571, 2022. 3
-
[55]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023. 3
work page 2023
-
[56]
Fast sampling of diffusion models with exponential integrator
Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. 2022. 2
work page 2022
-
[57]
gddim: Generalized denoising diffusion implicit models
Qinsheng Zhang, Molei Tao, and Yongxin Chen. gddim: Generalized denoising diffusion implicit models. arXiv preprint arXiv:2206.05564, 2022. 5
-
[58]
Sine: Single image editing with text-to-image diffusion models
Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris N Metaxas, and Jian Ren. Sine: Single image editing with text-to-image diffusion models. In CVPR, pages 6027–6037,
-
[59]
Learning to forecast and refine residual motion for image-to-video generation
Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, and Dimitris Metaxas. Learning to forecast and refine residual motion for image-to-video generation. In ECCV, pages 387– 403, 2018. 3
work page 2018
-
[60]
Truncated diffusion probabilistic models
Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. Truncated diffusion probabilistic models. stat, page 7, 2022. 2
work page 2022
-
[61]
MagicVideo: Efficient Video Generation With Latent Diffusion Models
Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.