Recognition: 2 theorem links
· Lean TheoremVideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Pith reviewed 2026-05-14 21:36 UTC · model grok-4.3
The pith
Open diffusion models generate realistic videos at 1024x576 resolution from text, with an image-to-video version that preserves input content.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors propose text-to-video and image-to-video diffusion models. The T2V model synthesizes realistic and cinematic-quality videos at a resolution of 1024 × 576, outperforming other open-source T2V models. The I2V model is the first open-source I2V foundation model that transforms a given image into a video clip while maintaining strict content preservation constraints on the reference image's content, structure, and style.
What carries the argument
Text-to-video (T2V) and image-to-video (I2V) diffusion models that use conditioning on text inputs for synthesis and on image inputs for content preservation.
Load-bearing premise
The models achieve the claimed levels of realism, cinematic quality, outperformance, and strict content preservation in generated videos.
What would settle it
An independent side-by-side evaluation or user study where the outputs do not match or exceed the quality of other open-source models or where I2V videos visibly alter the input image's structure or style.
read the original abstract
Video generation has increasingly gained interest in both academia and industry. Although commercial tools can generate plausible videos, there is a limited number of open-source models available for researchers and engineers. In this work, we introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V models synthesize a video based on a given text input, while I2V models incorporate an additional image input. Our proposed T2V model can generate realistic and cinematic-quality videos with a resolution of $1024 \times 576$, outperforming other open-source T2V models in terms of quality. The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style. This model is the first open-source I2V foundation model capable of transforming a given image into a video clip while maintaining content preservation constraints. We believe that these open-source video generation models will contribute significantly to the technological advancements within the community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VideoCrafter1, consisting of a text-to-video (T2V) diffusion model that generates realistic 1024×576 videos from text prompts and claims to outperform prior open-source T2V models, together with an image-to-video (I2V) diffusion model that converts a reference image into a video clip while strictly preserving content, structure, and style; the I2V component is presented as the first open-source foundation model satisfying these preservation constraints.
Significance. If the performance and preservation claims are backed by rigorous quantitative evaluation, the work would supply accessible high-resolution open-source video generation models, enabling broader research in video synthesis and related applications.
major comments (2)
- [Abstract] Abstract: the claim that the T2V model 'outperforms other open-source T2V models in terms of quality' lacks any supporting numerical results, named baselines, or evaluation protocol (e.g., FVD, CLIP-T scores on a shared test set); §4 must supply these comparisons for the central outperformance assertion to be verifiable.
- [Abstract] Abstract: the assertion that the I2V model is 'the first open-source I2V foundation model' capable of 'strictly' preserving content requires explicit comparison to prior open-source I2V methods and quantitative preservation metrics (e.g., per-frame LPIPS or temporal CLIP similarity to the reference image); without these, the novelty and constraint-satisfaction claims cannot be assessed.
minor comments (1)
- [Abstract] Ensure consistent use of math mode for resolution notation (1024 × 576) across all sections and figures.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the verifiability of our claims without altering the core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the T2V model 'outperforms other open-source T2V models in terms of quality' lacks any supporting numerical results, named baselines, or evaluation protocol (e.g., FVD, CLIP-T scores on a shared test set); §4 must supply these comparisons for the central outperformance assertion to be verifiable.
Authors: We agree that the abstract claim requires explicit support to be verifiable. Section 4 of the original manuscript already reports quantitative results on standard benchmarks (UCF101 and MSR-VTT), including FVD scores and CLIP-T similarity, with direct comparisons to open-source baselines such as ModelScope and CogVideo. To address the referee's concern, we will revise the abstract to briefly cite the key metrics (e.g., lower FVD than baselines) and name the evaluation protocol and test sets. This makes the outperformance assertion self-contained while preserving the existing detailed tables and protocols in §4. revision: yes
-
Referee: [Abstract] Abstract: the assertion that the I2V model is 'the first open-source I2V foundation model' capable of 'strictly' preserving content requires explicit comparison to prior open-source I2V methods and quantitative preservation metrics (e.g., per-frame LPIPS or temporal CLIP similarity to the reference image); without these, the novelty and constraint-satisfaction claims cannot be assessed.
Authors: We acknowledge that the 'first' and 'strictly preserving' claims need quantitative backing and explicit comparisons. The manuscript already demonstrates preservation through qualitative examples and architectural design choices (e.g., image conditioning strength). In the revision, we will add a dedicated subsection in §4 with quantitative preservation metrics, including per-frame LPIPS to the reference image and temporal CLIP similarity across generated frames. We will also include explicit comparisons to prior open-source I2V methods (e.g., any contemporaneous works available at submission time) in a new table. This substantiates the novelty and constraint-satisfaction claims. revision: yes
Circularity Check
No circularity: empirical model claims with no self-referential derivations
full rationale
The paper introduces T2V and I2V diffusion models and asserts their quality and content-preservation properties on the basis of architecture, training, and reported results. No equations, first-principles derivations, or parameter-fitting steps are described that reduce by construction to the inputs or to self-citations. The central claims are empirical assertions about new model capabilities rather than any closed logical loop of the kinds enumerated in the analysis criteria.
Axiom & Free-Parameter Ledger
free parameters (1)
- diffusion model hyperparameters
axioms (1)
- domain assumption Diffusion models can be extended to generate coherent high-resolution videos from text or images
Forward citations
Cited by 25 Pith papers
-
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
-
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
-
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
-
CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection
CMTA detects AI-generated videos by capturing unnatural temporal stability in visual-textual semantic alignment via joint embeddings and multi-grained temporal modeling, outperforming prior methods in cross-generator tests.
-
Novel View Synthesis as Video Completion
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
-
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.
-
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction
CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
-
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
-
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
-
FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.
-
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth embeds predicted camera poses into a spatio-temporal transformer to achieve state-of-the-art 3D-consistent video depth estimation.
-
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth predicts inter-frame camera poses to inject geometric embeddings into a spatio-temporal transformer, yielding state-of-the-art 3D-consistent video depth.
-
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth achieves improved 3D-consistent video depth by embedding predicted inter-frame camera poses into a network with an Alternating Spatio-Temporal Transformer for better spatial precision and temporal coherence.
-
Detecting AI-Generated Videos with Spiking Neural Networks
MAST with spiking neural networks achieves 93.14% mean accuracy detecting AI-generated videos from 10 unseen generators by exploiting smoother pixel residuals and compact semantic trajectories.
-
CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration
CineAGI is a multi-agent LLM framework that generates multi-scene movies with improved character consistency, narrative coherence, and audio-visual alignment.
-
Generative Refinement Networks for Visual Synthesis
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
-
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
NUMINA improves counting accuracy in text-to-video diffusion models by up to 7.4% via a training-free identify-then-guide framework on the new CountBench dataset.
-
ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity
ATSS detects AI-generated videos by measuring unnatural repetitive temporal correlations in triple similarity matrices derived from frame visuals and semantic descriptions.
-
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
-
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.
-
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
-
Movie Gen: A Cast of Media Foundation Models
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
-
Empowering Video Translation using Multimodal Large Language Models
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
-
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...
Reference graph
Works this paper leans on
-
[1]
Accessed October 22, 2023 [Online] https:// research.runwayml.com/gen2
Gen-2. Accessed October 22, 2023 [Online] https:// research.runwayml.com/gen2
work page 2023
-
[2]
Accessed October 22, 2023 [Online] https : / / github.com/deep-floyd/IF
If. Accessed October 22, 2023 [Online] https : / / github.com/deep-floyd/IF
work page 2023
-
[3]
Accessed October 22, 2023 [Online] https: //laion.ai/blog/laion-coco/
Laion-coco. Accessed October 22, 2023 [Online] https: //laion.ai/blog/laion-coco/
work page 2023
-
[4]
Accessed October 22, 2023 [Online] https: //github.com/hotshotco/Hotshot-XL
Hotshot-xl. Accessed October 22, 2023 [Online] https: //github.com/hotshotco/Hotshot-XL
work page 2023
-
[5]
Accessed October 22, 2023 [Online] https: //moonvalley.ai/
Moonvalley. Accessed October 22, 2023 [Online] https: //moonvalley.ai/
work page 2023
-
[6]
Accessed October 22, 2023 [Online] https: //www.pika.art/
Pika labs. Accessed October 22, 2023 [Online] https: //www.pika.art/
work page 2023
-
[7]
Accessed October 22, 2023 [Online] https: //huggingface.co/cerspense/zeroscope_v2_ XL
Zeroscope-xl. Accessed October 22, 2023 [Online] https: //huggingface.co/cerspense/zeroscope_v2_ XL
work page 2023
-
[8]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021
work page 2021
-
[9]
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324, 2022
work page internal anchor Pith review arXiv 2022
-
[10]
Align your latents: High-resolution video synthesis with la- tent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. In CVPR, 2023
work page 2023
-
[11]
Muse: Text-to-image generation via masked generative transform- ers
Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Mur- phy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transform- ers. arXiv preprint arXiv:2301.00704, 2023
-
[12]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Dif- fusiondet: Diffusion model for object detection
Shoufa Chen, Peize Sun, Yibing Song, and Ping Luo. Dif- fusiondet: Diffusion model for object detection. In Proceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 19830–19843, 2023
work page 2023
-
[14]
Reproducible scal- ing laws for contrastive language-image learning
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023
work page 2023
- [15]
-
[16]
Emu: Enhanc- ing image generation models using photogenic needles in a haystack
Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xi- aofang Wang, Abhimanyu Dubey, et al. Emu: Enhanc- ing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023
-
[17]
An image is worth 16x16 words: Trans- formers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. In ICLR, 2020
work page 2020
-
[18]
Structure and content-guided video synthesis with diffusion models
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In ICCV, 2023
work page 2023
-
[19]
Make-a-scene: Scene- based text-to-image generation with human priors
Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene- based text-to-image generation with human priors. In Eu- ropean Conference on Computer Vision , pages 89–106. Springer, 2022
work page 2022
-
[20]
Preserve your own correlation: A noise prior for video diffusion models
Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming- Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior for video diffusion models. In ICCV, 2023
work page 2023
-
[21]
Vec- tor quantized diffusion model for text-to-image synthesis
Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec- tor quantized diffusion model for text-to-image synthesis. In CVPR, 2022
work page 2022
-
[22]
Seer: Language Instructed Video Prediction with Latent Diffusion Models
Xianfan Gu, Chuan Wen, Jiaming Song, and Yang Gao. Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022
work page internal anchor Pith review arXiv 2022
-
[25]
Scalecrafter: Tuning-free higher- resolution visual generation with diffusion models
Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher- resolution visual generation with diffusion models. arXiv preprint arXiv:2310.07702, 2023
-
[26]
Denoising diffu- sion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. In NeurIPS, 2020
work page 2020
-
[27]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els. arXiv preprint arXiv:2210.02303, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models. In NeurIPS, 2022
work page 2022
-
[29]
Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to- image diffusion models are zero-shot video generators.arXiv preprint arXiv:2303.13439, 2023
-
[30]
Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, and Jingdong Wang. Videogen: A reference-guided latent diffusion ap- proach for high definition text-to-video generation. arXiv preprint arXiv:2309.00398, 2023. 10
-
[31]
Gligen: Open-set grounded text-to-image generation
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In CVPR, 2023
work page 2023
-
[32]
Evalcrafter: Benchmarking and eval- uating large video generation models, 2023
Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and eval- uating large video generation models, 2023
work page 2023
-
[33]
Videofusion: Decomposed diffusion models for high-quality video generation
Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tie- niu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR, 2023
work page 2023
-
[34]
Follow your pose: Pose-guided text-to-video generation using pose-free videos
Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Ying Shan, Xiu Li, and Qifeng Chen. Follow your pose: Pose-guided text-to-video generation using pose-free videos. arXiv preprint arXiv:2304.01186, 2023
-
[35]
Dreamix: Video diffusion models are general video editors
Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv Leviathan, and Yedid Hoshen. Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023
-
[36]
Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon- gang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023
-
[37]
Diffusion in the dark: A diffu- sion model for low-light text recognition
Cindy M Nguyen, Eric R Chan, Alexander W Bergman, and Gordon Wetzstein. Diffusion in the dark: A diffu- sion model for low-light text recognition. arXiv preprint arXiv:2303.04291, 2023
-
[38]
Glide: Towards photorealis- tic image generation and editing with text-guided diffusion models
Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealis- tic image generation and editing with text-guided diffusion models. 2022
work page 2022
-
[39]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. 2021
work page 2021
-
[41]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents. arXiv preprint arXiv:2204.06125, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[42]
High-resolution image syn- thesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In CVPR, 2022
work page 2022
-
[43]
Photorealistic text-to-image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022
work page 2022
-
[44]
Instant- booth: Personalized text-to-image generation without test- time finetuning
Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instant- booth: Personalized text-to-image generation without test- time finetuning. arXiv preprint arXiv:2304.03411, 2023
-
[45]
Make-a-video: Text-to-video generation without text-video data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023
work page 2023
-
[46]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. 2015
work page 2015
-
[47]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[48]
Score-based generative modeling through stochastic differential equa- tions
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. In ICLR, 2021
work page 2021
-
[49]
Phenaki: Variable length video generation from open domain textual description
Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin- dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description. In ICLR, 2023
work page 2023
-
[50]
ModelScope Text-to-Video Technical Report
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Videocomposer: Compositional video synthesis with motion controllability
Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023
-
[52]
Lavie: High-quality video gener- ation with cascaded latent diffusion models
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video gener- ation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023
-
[53]
Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, et al. Make-your-video: Customized video generation using textual and structural guidance.arXiv preprint arXiv:2306.00943, 2023
-
[54]
Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models. 2023
work page 2023
-
[55]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models. arXiv preprint arXiv:2308.06721, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
arXiv preprint arXiv:2308.08089 , year=
Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023
-
[57]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun- jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin- fei Yang, Burcu Karagol Ayan, et al. Scaling autoregres- sive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[58]
Magvit: Masked generative video transformer
Lijun Yu, Yong Cheng, Kihyuk Sohn, Jos ´e Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming- Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In CVPR, 2023. 11
work page 2023
-
[59]
Show-1: Marrying pixel and latent diffusion models for text-to-video generation, 2023
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation, 2023
work page 2023
-
[60]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023
work page 2023
-
[61]
Controlvideo: Training-free controllable text-to-video generation
Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023
-
[62]
Real- world image variation by aligning diffusion inversion chain
Yuechen Zhang, Jinbo Xing, Eric Lo, and Jiaya Jia. Real- world image variation by aligning diffusion inversion chain. arXiv preprint arXiv:2305.18729, 2023
-
[63]
Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022. 12
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.