arxiv: 2310.19512 · v1 · submitted 2023-10-30 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Haoxin Chen , Menghan Xia , Yingqing He , Yong Zhang , Xiaodong Cun , Shaoshu Yang , Jinbo Xing , Yaofang Liu

show 4 more authors

Qifeng Chen Xintao Wang Chao Weng Ying Shan

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationdiffusion modelstext-to-videoimage-to-videoopen-sourcehigh-resolution video

0 comments

The pith

Open diffusion models generate realistic videos at 1024x576 resolution from text, with an image-to-video version that preserves input content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces two open-source diffusion models for video generation. The text-to-video model creates realistic and cinematic videos from text prompts at 1024 by 576 pixels and outperforms other open-source alternatives. The image-to-video model takes a reference image and produces a video clip that keeps the original content, structure, and style intact, presented as the first such open foundation model. This addresses the scarcity of accessible high-quality video tools for researchers and engineers beyond commercial systems. The work positions these models as contributions to broader community progress in video synthesis.

Core claim

The authors propose text-to-video and image-to-video diffusion models. The T2V model synthesizes realistic and cinematic-quality videos at a resolution of 1024 × 576, outperforming other open-source T2V models. The I2V model is the first open-source I2V foundation model that transforms a given image into a video clip while maintaining strict content preservation constraints on the reference image's content, structure, and style.

What carries the argument

Text-to-video (T2V) and image-to-video (I2V) diffusion models that use conditioning on text inputs for synthesis and on image inputs for content preservation.

Load-bearing premise

The models achieve the claimed levels of realism, cinematic quality, outperformance, and strict content preservation in generated videos.

What would settle it

An independent side-by-side evaluation or user study where the outputs do not match or exceed the quality of other open-source models or where I2V videos visibly alter the input image's structure or style.

read the original abstract

Video generation has increasingly gained interest in both academia and industry. Although commercial tools can generate plausible videos, there is a limited number of open-source models available for researchers and engineers. In this work, we introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V models synthesize a video based on a given text input, while I2V models incorporate an additional image input. Our proposed T2V model can generate realistic and cinematic-quality videos with a resolution of $1024 \times 576$, outperforming other open-source T2V models in terms of quality. The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style. This model is the first open-source I2V foundation model capable of transforming a given image into a video clip while maintaining content preservation constraints. We believe that these open-source video generation models will contribute significantly to the technological advancements within the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VideoCrafter1 releases open T2V and I2V diffusion models at 1024x576 but the outperformance and preservation claims lack any numbers or baselines in the abstract.

read the letter

The main point on this paper is that it puts out open diffusion models for text-to-video and image-to-video generation, with the T2V version claiming cinematic quality at 1024 by 576 and the I2V version positioned as the first open one that keeps the input image's content, structure, and style fixed across frames. Releasing these as open source is the concrete step forward here, since most strong video generators stay closed and hard to build on. If the code and weights are actually shared, that gives the community usable starting points for further work on video diffusion. The I2V focus on explicit preservation constraints also fills a gap that earlier open efforts did not stress as directly. The soft spot is straightforward: the abstract states outperformance over other open T2V models and strict content adherence without showing any FVD scores, CLIP-T numbers, LPIPS preservation metrics, or named baseline comparisons. Those claims rest on the models existing rather than measured results, so it is difficult to judge how large the advance actually is. If the full paper supplies a clear evaluation protocol and quantitative tables, that would change the picture; right now the evidence is thin. This work is aimed at researchers and engineers who need accessible video generation baselines to experiment with or extend. A reader looking for open implementations to test on their own data would get practical value from it, especially once the results section is checked. I would send it to peer review. The open release and the I2V extension are worth a referee's time to verify the experiments and see whether the quality claims hold up under scrutiny.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces VideoCrafter1, consisting of a text-to-video (T2V) diffusion model that generates realistic 1024×576 videos from text prompts and claims to outperform prior open-source T2V models, together with an image-to-video (I2V) diffusion model that converts a reference image into a video clip while strictly preserving content, structure, and style; the I2V component is presented as the first open-source foundation model satisfying these preservation constraints.

Significance. If the performance and preservation claims are backed by rigorous quantitative evaluation, the work would supply accessible high-resolution open-source video generation models, enabling broader research in video synthesis and related applications.

major comments (2)

[Abstract] Abstract: the claim that the T2V model 'outperforms other open-source T2V models in terms of quality' lacks any supporting numerical results, named baselines, or evaluation protocol (e.g., FVD, CLIP-T scores on a shared test set); §4 must supply these comparisons for the central outperformance assertion to be verifiable.
[Abstract] Abstract: the assertion that the I2V model is 'the first open-source I2V foundation model' capable of 'strictly' preserving content requires explicit comparison to prior open-source I2V methods and quantitative preservation metrics (e.g., per-frame LPIPS or temporal CLIP similarity to the reference image); without these, the novelty and constraint-satisfaction claims cannot be assessed.

minor comments (1)

[Abstract] Ensure consistent use of math mode for resolution notation (1024 × 576) across all sections and figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the verifiability of our claims without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the T2V model 'outperforms other open-source T2V models in terms of quality' lacks any supporting numerical results, named baselines, or evaluation protocol (e.g., FVD, CLIP-T scores on a shared test set); §4 must supply these comparisons for the central outperformance assertion to be verifiable.

Authors: We agree that the abstract claim requires explicit support to be verifiable. Section 4 of the original manuscript already reports quantitative results on standard benchmarks (UCF101 and MSR-VTT), including FVD scores and CLIP-T similarity, with direct comparisons to open-source baselines such as ModelScope and CogVideo. To address the referee's concern, we will revise the abstract to briefly cite the key metrics (e.g., lower FVD than baselines) and name the evaluation protocol and test sets. This makes the outperformance assertion self-contained while preserving the existing detailed tables and protocols in §4. revision: yes
Referee: [Abstract] Abstract: the assertion that the I2V model is 'the first open-source I2V foundation model' capable of 'strictly' preserving content requires explicit comparison to prior open-source I2V methods and quantitative preservation metrics (e.g., per-frame LPIPS or temporal CLIP similarity to the reference image); without these, the novelty and constraint-satisfaction claims cannot be assessed.

Authors: We acknowledge that the 'first' and 'strictly preserving' claims need quantitative backing and explicit comparisons. The manuscript already demonstrates preservation through qualitative examples and architectural design choices (e.g., image conditioning strength). In the revision, we will add a dedicated subsection in §4 with quantitative preservation metrics, including per-frame LPIPS to the reference image and temporal CLIP similarity across generated frames. We will also include explicit comparisons to prior open-source I2V methods (e.g., any contemporaneous works available at submission time) in a new table. This substantiates the novelty and constraint-satisfaction claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model claims with no self-referential derivations

full rationale

The paper introduces T2V and I2V diffusion models and asserts their quality and content-preservation properties on the basis of architecture, training, and reported results. No equations, first-principles derivations, or parameter-fitting steps are described that reduce by construction to the inputs or to self-citations. The central claims are empirical assertions about new model capabilities rather than any closed logical loop of the kinds enumerated in the analysis criteria.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides insufficient detail on training; standard diffusion assumptions and many unspecified hyperparameters are inferred but not enumerated.

free parameters (1)

diffusion model hyperparameters
Typical training parameters such as noise schedules and learning rates are required but unspecified in abstract.

axioms (1)

domain assumption Diffusion models can be extended to generate coherent high-resolution videos from text or images
Core premise enabling the T2V and I2V approaches.

pith-pipeline@v0.9.0 · 5525 in / 1195 out tokens · 44058 ms · 2026-05-14T21:36:24.597386+00:00 · methodology

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 7.0

R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
cs.CV 2026-05 unverdicted novelty 7.0

FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
cs.CV 2026-05 unverdicted novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection
cs.CV 2026-05 unverdicted novelty 7.0

CMTA detects AI-generated videos by capturing unnatural temporal stability in visual-textual semantic alignment via joint embeddings and multi-grained temporal modeling, outperforming prior methods in cross-generator tests.
Novel View Synthesis as Video Completion
cs.CV 2026-04 unverdicted novelty 7.0

Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
cs.CV 2024-07 unverdicted novelty 7.0

OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
cs.CV 2026-05 unverdicted novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
cs.CV 2026-05 unverdicted novelty 6.0

V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
cs.CV 2026-05 unverdicted novelty 6.0

FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
cs.CV 2026-05 unverdicted novelty 6.0

GemDepth embeds predicted camera poses into a spatio-temporal transformer to achieve state-of-the-art 3D-consistent video depth estimation.
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
cs.CV 2026-05 unverdicted novelty 6.0

GemDepth predicts inter-frame camera poses to inject geometric embeddings into a spatio-temporal transformer, yielding state-of-the-art 3D-consistent video depth.
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
cs.CV 2026-05 unverdicted novelty 6.0

GemDepth achieves improved 3D-consistent video depth by embedding predicted inter-frame camera poses into a network with an Alternating Spatio-Temporal Transformer for better spatial precision and temporal coherence.
Detecting AI-Generated Videos with Spiking Neural Networks
cs.CV 2026-05 unverdicted novelty 6.0

MAST with spiking neural networks achieves 93.14% mean accuracy detecting AI-generated videos from 10 unseen generators by exploiting smoother pixel residuals and compact semantic trajectories.
CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration
cs.MM 2026-04 unverdicted novelty 6.0

CineAGI is a multi-agent LLM framework that generates multi-scene movies with improved character consistency, narrative coherence, and audio-visual alignment.
Generative Refinement Networks for Visual Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

NUMINA improves counting accuracy in text-to-video diffusion models by up to 7.4% via a training-free identify-then-guide framework on the new CountBench dataset.
ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity
cs.CV 2026-04 unverdicted novelty 6.0

ATSS detects AI-generated videos by measuring unnatural repetitive temporal correlations in triple similarity matrices derived from frame visuals and semantic descriptions.
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
cs.CV 2025-03 accept novelty 6.0

VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 5.0

R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 5.0

Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
Empowering Video Translation using Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
Show-o2: Improved Native Unified Multimodal Models
cs.CV 2025-06 unverdicted novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
cs.CV 2026-04 unverdicted novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 22 Pith papers · 12 internal anchors

[1]

Accessed October 22, 2023 [Online] https:// research.runwayml.com/gen2

Gen-2. Accessed October 22, 2023 [Online] https:// research.runwayml.com/gen2

work page 2023
[2]

Accessed October 22, 2023 [Online] https : / / github.com/deep-floyd/IF

If. Accessed October 22, 2023 [Online] https : / / github.com/deep-floyd/IF

work page 2023
[3]

Accessed October 22, 2023 [Online] https: //laion.ai/blog/laion-coco/

Laion-coco. Accessed October 22, 2023 [Online] https: //laion.ai/blog/laion-coco/

work page 2023
[4]

Accessed October 22, 2023 [Online] https: //github.com/hotshotco/Hotshot-XL

Hotshot-xl. Accessed October 22, 2023 [Online] https: //github.com/hotshotco/Hotshot-XL

work page 2023
[5]

Accessed October 22, 2023 [Online] https: //moonvalley.ai/

Moonvalley. Accessed October 22, 2023 [Online] https: //moonvalley.ai/

work page 2023
[6]

Accessed October 22, 2023 [Online] https: //www.pika.art/

Pika labs. Accessed October 22, 2023 [Online] https: //www.pika.art/

work page 2023
[7]

Accessed October 22, 2023 [Online] https: //huggingface.co/cerspense/zeroscope_v2_ XL

Zeroscope-xl. Accessed October 22, 2023 [Online] https: //huggingface.co/cerspense/zeroscope_v2_ XL

work page 2023
[8]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021

work page 2021
[9]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324, 2022

work page internal anchor Pith review arXiv 2022
[10]

Align your latents: High-resolution video synthesis with la- tent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. In CVPR, 2023

work page 2023
[11]

Muse: Text-to-image generation via masked generative transform- ers

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Mur- phy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transform- ers. arXiv preprint arXiv:2301.00704, 2023

work page arXiv 2023
[12]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Dif- fusiondet: Diffusion model for object detection

Shoufa Chen, Peize Sun, Yibing Song, and Ping Luo. Dif- fusiondet: Diffusion model for object detection. In Proceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 19830–19843, 2023

work page 2023
[14]

Reproducible scal- ing laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023

work page 2023
[15]

I2vgen-xl

I2VGen-XL contributors. I2vgen-xl. ModelScope. Accessed October 15, 2023 [Online] https://modelscope.cn/ models/damo/Image-to-Video/summary

work page 2023
[16]

Emu: Enhanc- ing image generation models using photogenic needles in a haystack

Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xi- aofang Wang, Abhimanyu Dubey, et al. Emu: Enhanc- ing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023

work page arXiv 2023
[17]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. In ICLR, 2020

work page 2020
[18]

Structure and content-guided video synthesis with diffusion models

Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In ICCV, 2023

work page 2023
[19]

Make-a-scene: Scene- based text-to-image generation with human priors

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene- based text-to-image generation with human priors. In Eu- ropean Conference on Computer Vision , pages 89–106. Springer, 2022

work page 2022
[20]

Preserve your own correlation: A noise prior for video diffusion models

Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming- Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior for video diffusion models. In ICCV, 2023

work page 2023
[21]

Vec- tor quantized diffusion model for text-to-image synthesis

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec- tor quantized diffusion model for text-to-image synthesis. In CVPR, 2022

work page 2022
[22]

Seer: Language Instructed Video Prediction with Latent Diffusion Models

Xianfan Gu, Chuan Wen, Jiaming Song, and Yang Gao. Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022

work page internal anchor Pith review arXiv 2022
[25]

Scalecrafter: Tuning-free higher- resolution visual generation with diffusion models

Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher- resolution visual generation with diffusion models. arXiv preprint arXiv:2310.07702, 2023

work page arXiv 2023
[26]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. In NeurIPS, 2020

work page 2020
[27]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els. arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Video dif- fusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models. In NeurIPS, 2022

work page 2022
[29]

Text2video-zero: Text-to- image diffusion models are zero-shot video generators.arXiv preprint arXiv:2303.13439, 2023

Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to- image diffusion models are zero-shot video generators.arXiv preprint arXiv:2303.13439, 2023

work page arXiv 2023
[30]

Videogen: A reference-guided latent diffusion ap- proach for high definition text-to-video generation

Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, and Jingdong Wang. Videogen: A reference-guided latent diffusion ap- proach for high definition text-to-video generation. arXiv preprint arXiv:2309.00398, 2023. 10

work page arXiv 2023
[31]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In CVPR, 2023

work page 2023
[32]

Evalcrafter: Benchmarking and eval- uating large video generation models, 2023

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and eval- uating large video generation models, 2023

work page 2023
[33]

Videofusion: Decomposed diffusion models for high-quality video generation

Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tie- niu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR, 2023

work page 2023
[34]

Follow your pose: Pose-guided text-to-video generation using pose-free videos

Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Ying Shan, Xiu Li, and Qifeng Chen. Follow your pose: Pose-guided text-to-video generation using pose-free videos. arXiv preprint arXiv:2304.01186, 2023

work page arXiv 2023
[35]

Dreamix: Video diffusion models are general video editors

Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv Leviathan, and Yedid Hoshen. Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023

work page arXiv 2023
[36]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon- gang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023

work page arXiv 2023
[37]

Diffusion in the dark: A diffu- sion model for low-light text recognition

Cindy M Nguyen, Eric R Chan, Alexander W Bergman, and Gordon Wetzstein. Diffusion in the dark: A diffu- sion model for low-light text recognition. arXiv preprint arXiv:2303.04291, 2023

work page arXiv 2023
[38]

Glide: Towards photorealis- tic image generation and editing with text-guided diffusion models

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealis- tic image generation and editing with text-guided diffusion models. 2022

work page 2022
[39]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. 2021

work page 2021
[41]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents. arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In CVPR, 2022

work page 2022
[43]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022

work page 2022
[44]

Instant- booth: Personalized text-to-image generation without test- time finetuning

Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instant- booth: Personalized text-to-image generation without test- time finetuning. arXiv preprint arXiv:2304.03411, 2023

work page arXiv 2023
[45]

Make-a-video: Text-to-video generation without text-video data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023

work page 2023
[46]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. 2015

work page 2015
[47]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[48]

Score-based generative modeling through stochastic differential equa- tions

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. In ICLR, 2021

work page 2021
[49]

Phenaki: Variable length video generation from open domain textual description

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin- dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description. In ICLR, 2023

work page 2023
[50]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Videocomposer: Compositional video synthesis with motion controllability

Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023

work page arXiv 2023
[52]

Lavie: High-quality video gener- ation with cascaded latent diffusion models

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video gener- ation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023

work page arXiv 2023
[53]

Make-your-video: Customized video generation using textual and structural guidance.arXiv preprint arXiv:2306.00943, 2023

Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, et al. Make-your-video: Customized video generation using textual and structural guidance.arXiv preprint arXiv:2306.00943, 2023

work page arXiv 2023
[54]

Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models. 2023

work page 2023
[55]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models. arXiv preprint arXiv:2308.06721, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

arXiv preprint arXiv:2308.08089 , year=

Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023

work page arXiv 2023
[57]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun- jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin- fei Yang, Burcu Karagol Ayan, et al. Scaling autoregres- sive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

Magvit: Masked generative video transformer

Lijun Yu, Yong Cheng, Kihyuk Sohn, Jos ´e Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming- Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In CVPR, 2023. 11

work page 2023
[59]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation, 2023

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation, 2023

work page 2023
[60]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023

work page 2023
[61]

Controlvideo: Training-free controllable text-to-video generation

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023

work page arXiv 2023
[62]

Real- world image variation by aligning diffusion inversion chain

Yuechen Zhang, Jinbo Xing, Eric Lo, and Jiaya Jia. Real- world image variation by aligning diffusion inversion chain. arXiv preprint arXiv:2305.18729, 2023

work page arXiv 2023
[63]

arXiv:2211.11018 , year=

Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022. 12

work page arXiv 2022