DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

Chi Jin; Difan Liu; Haitian Zheng; Krishna Kumar Singh; Qiang Zhang; Yan Kang; Yuchen Liu; Zhe Lin; Zihan Ding

arxiv: 2412.15689 · v2 · submitted 2024-12-20 · 💻 cs.CV

DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

Zihan Ding , Chi Jin , Difan Liu , Haitian Zheng , Krishna Kumar Singh , Qiang Zhang , Yan Kang , Zhe Lin

show 1 more author

Yuchen Liu

This is my paper

Pith reviewed 2026-05-23 06:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationdiffusion modelsdistillationfew-step samplingreward optimizationlatent space fine-tuningsampling acceleration

0 comments

The pith

Distillation of video diffusion models into 1-4 sampling steps preserves quality and diversity while exceeding the original model's benchmark scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that diffusion-based video generators can be distilled into much faster versions that still produce good results. It combines variational score distillation with consistency distillation to shrink the required sampling steps from around 50 down to 1-4. A follow-on step uses a latent-space reward model to tune the output toward any chosen metric without needing the reward to be differentiable or large extra memory. This matters because full diffusion sampling is too slow for many uses, so a reliable few-step version could make video generation practical for longer clips. The authors report that their student models outperform the teacher on VBench for 128-frame, 10-second videos and receive better human ratings than 50-step outputs.

Core claim

The central claim is that merging variational score distillation and consistency distillation produces student models able to generate 10-second videos in one to four steps. Adding latent reward model fine-tuning further improves results on any specified metric. The resulting models reach a VBench score of 82.57, higher than the teacher and competing systems, while one-step versions run up to 278.6 times faster than the teacher's 50-step process. Human evaluations confirm the 4-step outputs are preferred over the teacher's 50-step results.

What carries the argument

The central mechanism is the pairing of variational score distillation and consistency distillation to compress the diffusion sampling process, extended by latent reward model fine-tuning that works without differentiability of the reward.

If this is right

Video generation at 12 frames per second for 128 frames becomes feasible with far less compute than standard diffusion sampling.
The same distillation pipeline can be applied to improve results on any chosen reward metric through non-differentiable latent-space tuning.
One-step versions deliver near real-time generation while the 4-step versions match or exceed multi-step baselines on standard video quality measures.
Human preference data align with the automatic scores, indicating the speed gain does not come at the cost of visible quality loss.
The method supports state-of-the-art few-step performance on 10-second video clips compared with other published systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation steps could be tested on image or audio diffusion models to check whether the step reduction generalizes across modalities.
Because the reward tuning requires no differentiability, it could be used to align outputs with evolving human feedback loops without retraining the full model.
Extending the approach to videos longer than 10 seconds would test whether motion consistency holds when the reduced-step regime is pushed further.
Practical deployment in video editing software becomes more realistic once generation time drops below real-time thresholds.

Load-bearing premise

The assumption that the distillation process keeps perceptual quality and diversity intact when cutting steps from 50 to 1-4 without creating new flaws that the chosen benchmark misses.

What would settle it

A controlled test in which human viewers consistently rate the 50-step teacher videos higher than the 4-step student videos on the same prompts would show the performance claim does not hold.

Figures

Figures reproduced from arXiv: 2412.15689 by Chi Jin, Difan Liu, Haitian Zheng, Krishna Kumar Singh, Qiang Zhang, Yan Kang, Yuchen Liu, Zhe Lin, Zihan Ding.

**Figure 2.** Figure 2: Method Overview: The few-step generator Gθ is trained to generate high-quality samples from random noise in latent space, guided by a combination of variational score distillation (VSD), consistency distillation (CD), and latent reward model (LRM) fine-tuning objectives. VSD loss enhances sample quality, albeit with a risk of mode collapse, while CD loss increases sample diversity without compromising gene… view at source ↗

**Figure 3.** Figure 3: Demonstration of the conjugate velocity prediction: relationship of v-prediction for diffusion and rectified flow. with the sample xt being diffused along the diffusion trajectory according to the schedule defined as Eq. (1). The model is parameterized to predict velocity vt on RF trajectory at each timestep t, with a constant target (x0 −ε) (we take a reverse here as opposed to standard RF for notation … view at source ↗

**Figure 4.** Figure 4: Visualization of samples in training dataset (left) and [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of different reward fine-tuning methods: [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Human evaluation results over four independent metrics: visual quality, text-video alignment, motion and general preference. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Compare the generated samples with (first line) and [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 9.** Figure 9: Latent reward model fine-tuning process for dynamic [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 11.** Figure 11: The learning process of LRM with PickScore reward. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 10.** Figure 10: The learning process of LRM with HPSv2 reward. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 12.** Figure 12: Reward model fine-tuning process with VSD+DDPO [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: The user interface for human evaluation experiments. [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Reward model fine-tuning with dynamic degree: (left) [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: More qualitative results of our method (VSD+CD+LRM). Five frames are displayed for each video (frame index: 0, 30, 60, 90, [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: More qualitative results of our method (VSD+CD+LRM). Five frames are displayed for each video (frame index: 0, 30, 60, 90, [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗

**Figure 17.** Figure 17: More qualitative results of our method (VSD+CD+LRM). Five frames are displayed for each video (frame index: 0, 30, 60, 90, [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗

**Figure 18.** Figure 18: Comparison of our method (VSD+CD+LRM) against several baselines. Five frames are displayed for each video (frame index: [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗

**Figure 19.** Figure 19: Comparison of our method (VSD+CD+LRM) against several baselines. Five frames are displayed for each video (frame index: [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: Visualization of video samples using different methods: VSD, VSD+DDPO(PickScore), VSD+DDPO(HPSv2), [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗

**Figure 21.** Figure 21: Visualization of video samples using different methods: VSD, VSD+DDPO(PickScore), VSD+DDPO(HPSv2), [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗

**Figure 22.** Figure 22: VSD 1, 2, 4 steps, teacher with 50 steps. Five frames are displayed for each video (frame index: 0, 30, 60, 90, 120). [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗

**Figure 23.** Figure 23: VSD 1, 2, 4 steps, teacher with 50 steps. Five frames are displayed for each video (frame index: 0, 30, 60, 90, 120). [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗

**Figure 24.** Figure 24: Sample results of 4 steps DDIM by the teacher model and 4 steps student model with CD loss. One frame for each video. [PITH_FULL_IMAGE:figures/full_fig_p035_24.png] view at source ↗

**Figure 25.** Figure 25: Diversity: VSD (top line), VSD+CD (middle line), teacher (bottom line), one frame from each video (5 videos). [PITH_FULL_IMAGE:figures/full_fig_p036_25.png] view at source ↗

**Figure 26.** Figure 26: Diversity: VSD (top line), VSD+CD (middle line), teacher (bottom line), one frame from each video (5 videos). [PITH_FULL_IMAGE:figures/full_fig_p037_26.png] view at source ↗

**Figure 27.** Figure 27: Comparison of sampled videos for VBench long and short prompts. Five frames are displayed for each video (frame index: 0, [PITH_FULL_IMAGE:figures/full_fig_p038_27.png] view at source ↗

**Figure 28.** Figure 28: Video generation with diverse styles, using prompts from VBench. Five frames are extracted uniformly from one video for each [PITH_FULL_IMAGE:figures/full_fig_p039_28.png] view at source ↗

**Figure 29.** Figure 29: Video generation with diverse camera motions, using prompts from VBench. Five frames are extracted uniformly from one [PITH_FULL_IMAGE:figures/full_fig_p040_29.png] view at source ↗

read the original abstract

Diffusion probabilistic models have shown significant progress in video generation; however, their computational efficiency is limited by the large number of sampling steps required. Reducing sampling steps often compromises video quality or generation diversity. In this work, we introduce a distillation method that combines variational score distillation and consistency distillation to achieve few-step video generation, maintaining both high quality and diversity. We also propose a latent reward model fine-tuning approach to further enhance video generation performance according to any specified reward metric. This approach reduces memory usage and does not require the reward to be differentiable. Our method demonstrates state-of-the-art performance in few-step generation for 10-second videos (128 frames at 12 FPS). The distilled student model achieves a score of 82.57 on VBench, surpassing the teacher model as well as baseline models Gen-3, T2V-Turbo, and Kling. One-step distillation accelerates the teacher model's diffusion sampling by up to 278.6 times, enabling near real-time generation. Human evaluations further validate the superior performance of our 4-step student models compared to teacher model using 50-step DDIM sampling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DOLLAR pairs variational score distillation with consistency distillation plus latent reward tuning to reach 1-4 step video generation that beats the teacher on VBench.

read the letter

The main takeaway is that this method trains a student video model by combining variational score distillation and consistency distillation, then applies latent-space reward fine-tuning that works with any metric and avoids differentiability or high memory costs. The result is claimed 82.57 on VBench for 10-second clips, beating the 50-step teacher and baselines like Gen-3, T2V-Turbo, and Kling, with up to 278x speedup and supporting human preference data for the 4-step version.

Referee Report

1 major / 0 minor

Summary. The paper proposes DOLLAR, a distillation framework that combines variational score distillation with consistency distillation to reduce diffusion sampling steps for video generation while aiming to preserve quality and diversity. It further introduces a latent reward model fine-tuning procedure that optimizes according to arbitrary (possibly non-differentiable) reward metrics with reduced memory cost. The central empirical claims are that the resulting student model reaches 82.57 on VBench (surpassing the teacher and baselines Gen-3, T2V-Turbo, Kling), that one-step distillation yields up to 278.6× acceleration, and that 4-step student models are preferred over 50-step DDIM teacher sampling in human evaluations for 10-second (128-frame) videos.

Significance. If the reported gains and human-preference results hold under rigorous controls, the work would constitute a meaningful practical advance in few-step video synthesis, directly addressing the sampling-cost barrier that currently limits deployment of high-quality diffusion video models. The latent-reward fine-tuning component, if shown to be stable and general, would also be of independent interest for reward-driven alignment without differentiability constraints.

major comments (1)

[Abstract] Abstract: the central performance numbers (VBench 82.57, 278.6× speedup, human preference over 50-step DDIM) are presented without any accompanying ablation tables, variance estimates, or protocol details on how step counts and reward metrics were selected post-hoc. Because these numbers are load-bearing for the claim of “state-of-the-art few-step generation,” the absence of such controls prevents assessment of whether the distillation combination truly preserves diversity and avoids artifacts that VBench may miss.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency around the headline numbers in the abstract. The full manuscript contains the requested ablations, variance reporting, and protocol details in Sections 4 and 5; we will revise the abstract to reference these controls explicitly and add a short statement on metric and step selection.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance numbers (VBench 82.57, 278.6× speedup, human preference over 50-step DDIM) are presented without any accompanying ablation tables, variance estimates, or protocol details on how step counts and reward metrics were selected post-hoc. Because these numbers are load-bearing for the claim of “state-of-the-art few-step generation,” the absence of such controls prevents assessment of whether the distillation combination truly preserves diversity and avoids artifacts that VBench may miss.

Authors: We agree the abstract is too terse. The manuscript already reports (i) ablation tables over 1/2/4/8-step students and multiple reward combinations (Section 4.2–4.3), (ii) standard deviations across three random seeds for VBench and human preference scores (Table 2 and Appendix C), and (iii) explicit protocol: step counts follow the common few-step regime used by prior work (T2V-Turbo, InstaFlow); reward metrics were chosen a priori to span the VBench axes plus CLIP aesthetic and motion smoothness, with the latent reward model trained once on the union. Human studies (Section 5.3) directly compare 4-step DOLLAR against 50-step DDIM on the same prompts and show preference for diversity and artifact reduction. We will expand the abstract by one sentence referencing these controls and will move the post-hoc selection concern into a dedicated paragraph in Section 4.1. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and context contain no equations, derivation steps, or self-citations that reduce performance claims to fitted inputs by construction. Claims rest on external benchmarks (VBench) and comparisons to independent models (Gen-3, T2V-Turbo, Kling), with no evidence of self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. The distillation approach is described at a high level without internal reductions that would trigger circularity patterns. This is the expected outcome for a methods paper whose central results are externally validated rather than self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5745 in / 1040 out tokens · 23913 ms · 2026-05-23T06:55:08.108487+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

distillation method that combines variational score distillation and consistency distillation... latent reward model fine-tuning approach
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

one-step distillation accelerates... 278.6 times

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
SURF: Signature-Retained Fast Video Generation
cs.GR 2025-11 unverdicted novelty 6.0

SURF accelerates high-resolution video generation up to 12.5x by using noise reshifting for low-res previews from pretrained models and a shifting-window Refiner for efficient upscaling that retains original signatures.
Learning World Models for Interactive Video Generation
cs.CV 2025-05 unverdicted novelty 5.0

The work introduces video retrieval augmented generation (VRAG) with explicit global state conditioning to reduce compounding errors and improve spatiotemporal consistency in interactive video world models.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 3 Pith papers · 27 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Lumiere: A space-time diffusion model for video generation.arXiv preprint arXiv:2401.12945, 2024

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space- time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024. 2

work page arXiv 2024
[3]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning. arXiv preprint arXiv:2305.13301, 2023. 3, 5, 10, 12, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Align your latents: High-resolution video synthesis with la- tent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023. 2, 7

work page 2023
[6]

Videocrafter2: 12 Overcoming data limitations for high-quality video diffu- sion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: 12 Overcoming data limitations for high-quality video diffu- sion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 7310– 7320, 2024. 2

work page 2024
[7]

Diffusion Posterior Sampling for General Noisy Inverse Problems

Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffusion posterior sam- pling for general noisy inverse problems. arXiv preprint arXiv:2209.14687, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable re- wards. arXiv preprint arXiv:2309.17400, 2023. 3, 5, 6, 10, 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Adjoint matching: Fine- tuning flow and diffusion generative models with memoryless stochastic optimal control.arXiv preprint arXiv:2409.08861, 2024

Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky TQ Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic op- timal control. arXiv preprint arXiv:2409.08861, 2024. 3, 4, 9

work page arXiv 2024
[10]

Structure- aware video generation with latent diffusion models

Patrick Esser, Robin Rombach, and Bj¨orn Ommer. Structure- aware video generation with latent diffusion models. arXiv preprint arXiv:2303.07332, 2023. 1, 2, 8

work page arXiv 2023
[11]

The vendi score: A diversity evaluation metric for machine learning

Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning. arXiv preprint arXiv:2210.02410, 2022. 9

work page arXiv 2022
[12]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. 2

work page 2014
[13]

Flexible diffusion modeling of long videos.arXiv preprint arXiv:2205.11495, 2022

William Harvey, Søren Nørskov, Niklas K ¨olch, and George V ogiatzis. Flexible diffusion modeling of long videos.arXiv preprint arXiv:2205.11495, 2022. 2

work page arXiv 2022
[14]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video genera- tion

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video genera- tion. arXiv preprint arXiv:2406.15252, 2024. 2, 3

work page arXiv 2024
[15]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. In Advances in Neural Informa- tion Processing Systems, pages 6840–6851, 2020. 1, 4

work page 2020
[17]

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffu- sion models. arXiv preprint arXiv:2204.03458, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Yu Hong, Jing Wei, Xing Liu, Xiaodi Wang, Yutong Bai, Haitao Li, Ming Zhang, and Hao Xu. Cogvideo: Large- scale pretraining for text-to-video generation with transform- ers. arXiv preprint arXiv:2205.15868, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Vbench: Comprehensive bench- mark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 7

work page 2024
[20]

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954 ,

work page arXiv
[21]

Text2video-zero: Zero-shot text-to-video genera- tion using pretrained text-to-image diffusion models

Levon Khachatryan, Adrien Davy, Baptiste Emond, and Jun Wang. Text2video-zero: Zero-shot text-to-video genera- tion using pretrained text-to-image diffusion models. arXiv preprint arXiv:2302.01327, 2023. 2

work page arXiv 2023
[22]

Consistency traject ory models: Learning probability ﬂow ode trajectory of diﬀusion

Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Mu- rata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory mod- els: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279, 2023. 2

work page arXiv 2023
[23]

Auto-Encoding Variational Bayes

Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 7

work page internal anchor Pith review Pith/arXiv arXiv 2013
[24]

Pick-a-pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems , 36: 36652–36663, 2023. 7, 2, 3

work page 2023
[25]

Kuaishou. Kling. https://kling.kuaishou.com/ en, 2024. Accessed: [today’s date]. 1, 8

work page 2024
[26]

T2v- turbo: Breaking the quality bottleneck of video consis- tency model with mixed reward feedback

Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sug- ato Basu, Wenhu Chen, and William Yang Wang. T2v- turbo: Breaking the quality bottleneck of video consis- tency model with mixed reward feedback. arXiv preprint arXiv:2405.18750, 2024. 1, 3, 8

work page arXiv 2024
[27]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. arXiv preprint arXiv:2210.02747, 2022. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Instaflow: One step is enough for high-quality diffusion- based text-to-image generation

Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, et al. Instaflow: One step is enough for high-quality diffusion- based text-to-image generation. In The Twelfth International Conference on Learning Representations, 2023. 3, 4

work page 2023
[30]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 7

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. arXiv preprint arXiv:2410.11081, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Dpm-solver: A fast ode solver for diﬀusion probabilistic model sampling in around 10 step s

Chao Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ode solver for diffu- sion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022. 3

work page arXiv 2022
[33]

Knowledge distillation for generative models

Eric Luhman and Tobias Luhman. Knowledge distillation for generative models. arXiv preprint arXiv:2106.05237, 2021. 1, 3

work page arXiv 2021
[34]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023. 3, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Diff-instruct: A universal approach for transferring knowledge from pre-trained diffu- sion models

Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffu- sion models. Advances in Neural Information Processing Systems, 36, 2024. 2, 3

work page 2024
[36]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

work page
[37]

Pika Labs

Pika Labs. Pika Labs. https://www.pika.art/. Ac- cessed: September 25, 2023. 8

work page 2023
[38]

Model compression via distillation and quantization

Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668, 2018. 1

work page internal anchor Pith review Pith/arXiv arXiv 2018
[39]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022. 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737, 2024

Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Kate- rina Fragkiadaki, and Deepak Pathak. Video diffusion align- ment via reward gradients.arXiv preprint arXiv:2407.08737,

work page arXiv
[42]

Diffusion Policy Policy Optimization

Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Ben- jamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 4, 6, 7

work page 2022
[44]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Multistep distilla- tion of diffusion models via moment matching

Tim Salimans, Thomas Mensink, Jonathan Heek, and Emiel Hoogeboom. Multistep distillation of diffusion models via moment matching. arXiv preprint arXiv:2406.04103, 2024. 2, 3

work page arXiv 2024
[46]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural In- formation Processing Systems , 35:25278–25294, 2022. 2, 3

work page 2022
[47]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Eliya Nachmani, Guy Dahan, Eli Shechtman, and Haggai Hacohen. Make-a-video: Text- to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

Weiss, Niru Mah- eswaranathan, and Surya Ganguli

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Mah- eswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Interna- tional Conference on Machine Learning , pages 2256–2265,

work page
[49]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. In International Conference on Learning Representations, 2021. 3, 1

work page 2021
[50]

Generative modeling by esti- mating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by esti- mating gradients of the data distribution. Advances in neural information processing systems, 32, 2019. 1

work page 2019
[51]

Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. In International Conference on Learning Represen- tations, 2021. 1, 4

work page 2021
[52]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023. 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Phenaki: Variable Length Video Generation From Open Domain Textual Description

Ruben Villegas, Jiahui Yang, Sergey Tulyakov, Jan Kautz, and Seungjun Hong. Phenaki: Variable length video gener- ation from open domain textual descriptions. arXiv preprint arXiv:2210.02399, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[54]

Animatelcm: Accelerating the animation of personalized diffusion mod- els and adapters with decoupled consistency learning

Fu-Yun Wang, Zhaoyang Huang, Xiaoyu Shi, Weikang Bian, Guanglu Song, Yu Liu, and Hongsheng Li. Animatelcm: Accelerating the animation of personalized diffusion mod- els and adapters with decoupled consistency learning. arXiv preprint arXiv:2402.00769, 2024. 3

work page arXiv 2024
[55]

Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation

Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023. 2

work page 2023
[56]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Videolcm: Video latent consistency model.arXiv preprint arXiv:2312.09109,

Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, and Nong Sang. Videolcm: Video latent consistency model. arXiv preprint arXiv:2312.09109,

work page arXiv
[58]

Lavie: High-quality video gener- ation with cascaded latent diffusion models

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video gener- ation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023. 2

work page arXiv 2023
[59]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Internvideo2: Scaling video foundation mod- els for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation mod- els for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024. 2, 3

work page arXiv 2024
[61]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion. Advances in Neural Information Processing Systems , 36, 2024. 2, 3, 5

work page 2024
[62]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 ,

work page internal anchor Pith review Pith/arXiv arXiv
[63]

Human preference score: Better aligning text- to-image models with human preference

Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hong- sheng Li. Human preference score: Better aligning text- to-image models with human preference. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 2096–2105, 2023. 2, 3 14

work page 2096
[64]

Tackling the generative learning trilemma with denoising diffusion gans.arXiv preprint arXiv:2112.07804,

Tianyu Xiao, Dara Bahri, Pawel Lucjan Stanczuk, Duygu Ceylan, Julian McAuley, Arash Vahdat, and Jan Kautz. Tack- ling the generative learning trilemma with denoising diffu- sion gans. arXiv preprint arXiv:2112.07804, 2021. 3

work page arXiv 2021
[65]

Dual diffusion models for high-fidelity video generation

Tong Xiao, Peng Liu, and Yi Yang. Dual diffusion models for high-fidelity video generation. arXiv preprint arXiv:2301.06513, 2023. 2

work page arXiv 2023
[66]

Kingma, Tingbo Hou, Ying Nian Wu, Kevin Patrick Murphy, Tim Salimans, Ben Poole, and Ruiqi Gao

Sirui Xie, Zhisheng Xiao, Diederik P Kingma, Tingbo Hou, Ying Nian Wu, Kevin Patrick Murphy, Tim Salimans, Ben Poole, and Ruiqi Gao. Em distillation for one-step diffusion models. arXiv preprint arXiv:2405.16852, 2024. 2, 3

work page arXiv 2024
[67]

Imagere- ward: Learning and evaluating human preferences for text- to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation. Advances in Neural Information Pro- cessing Systems, 36, 2024. 3, 5, 10, 1, 2

work page 2024
[68]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024. 2, 7, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Improved distribution matching distillation for fast image synthesis.arXiv preprint arXiv:2405.14867,

Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Im- proved distribution matching distillation for fast image syn- thesis. arXiv preprint arXiv:2405.14867, 2024. 2, 3, 5, 7

work page arXiv 2024
[70]

One-step diffusion with distribution matching distillation

Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition , pages 6613–6623, 2024. 2, 3

work page 2024
[71]

Sf-v: Single forward video generation model

Zhixing Zhang, Yanyu Li, Yushu Wu, Yanwu Xu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Dimitris Metaxas, et al. Sf-v: Single forward video generation model. arXiv preprint arXiv:2406.04324,

work page arXiv
[72]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien- Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experi- ences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Open-sora: Democratizing efficient video production for all, 2024

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all, 2024. 7 15 DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization Supplementary Material Table of Contents 6 . Derivations 1 6.1 . Proof of Eq. (5) . ...

work page 2024
[74]

More Qualitative Results

Visualization 9 10.1. More Qualitative Results . . . . . . . . . . 9 10.2. Comparison of Reward Model Fine-tuning . 9 10.3. Inference Steps . . . . . . . . . . . . . . . 10 10.4. Diversity . . . . . . . . . . . . . . . . . . . 10 10.5. Prompt Length . . . . . . . . . . . . . . . 10 10.6. Sampling with Various Styles and Motions . 10

work page
[75]

Proof of Eq

Derivations 6.1. Proof of Eq. (5) We start from the forward diffusion process of DDPM [16]. The distribution of one-step diffusion processq(xt|xt−1) = N (xt; √αtxt−1, (1 − αt)I) can be equivalently written as: xt = √αtxt−1 + √ 1 − αtε, ε ∼ N (0, I) (18) with t ∈ [T ]. By chain rule, we have xt = √¯αtx0 + √ 1 − ¯αtε (19) with ¯αt = Πt i=1αi. Equivalently, ...

work page
[76]

Direct Reward Gradient In this section, we discuss in details why the direct reward gradient methods like ReFL [67] and DRaFT [8], cannot fit into the memory efficiently

Reward Model Fine-Tuning 7.1. Direct Reward Gradient In this section, we discuss in details why the direct reward gradient methods like ReFL [67] and DRaFT [8], cannot fit into the memory efficiently. 1 Take the HPSv2 [62] model as an example. It ap- plies fine-tuned version of ViT-H/14 variant of CLIP model, which contains 32 image transformer layers and...

work page 2000
[77]

Human Evaluation Human Evaluation Details

Additional Experimental Results 8.1. Human Evaluation Human Evaluation Details. Fig. 13 displays the user in- terface for human evaluation experiments. The four choices include visual quality, text-video alignment, motion and general preference, which correspond to the four reported metrics in Fig. 6. For the pairwise comparison of meth- ods, the videos a...

work page
[78]

The experiments in Sec

Challenges and Discussions Long Prompt Bias. The experiments in Sec. 4.5 show that, current models perform better for long and more de- scriptive prompt, which is inherited from the teacher model. The reason is hypothesized to be the well-captioned text-to- video training dataset, which emphasize detailed descrip- tions. With longer prompts, the text-vide...

work page
[79]

Cool Baby

Visualization 10.1. More Qualitative Results More qualitative results of our methods (VSD+CD+LRM) are displayed in Fig. 15, 16 and 17. Visual comparison of our methods with baselines in Tab. 2 for generated samples with the same prompt is shown in Fig. 18 and 19. For fair of comparison, we visualize all sampled frames with resolution 192 × 320 as the typi...

work page

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Lumiere: A space-time diffusion model for video generation.arXiv preprint arXiv:2401.12945, 2024

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space- time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024. 2

work page arXiv 2024

[3] [3]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning. arXiv preprint arXiv:2305.13301, 2023. 3, 5, 10, 12, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Align your latents: High-resolution video synthesis with la- tent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023. 2, 7

work page 2023

[6] [6]

Videocrafter2: 12 Overcoming data limitations for high-quality video diffu- sion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: 12 Overcoming data limitations for high-quality video diffu- sion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 7310– 7320, 2024. 2

work page 2024

[7] [7]

Diffusion Posterior Sampling for General Noisy Inverse Problems

Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffusion posterior sam- pling for general noisy inverse problems. arXiv preprint arXiv:2209.14687, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable re- wards. arXiv preprint arXiv:2309.17400, 2023. 3, 5, 6, 10, 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Adjoint matching: Fine- tuning flow and diffusion generative models with memoryless stochastic optimal control.arXiv preprint arXiv:2409.08861, 2024

Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky TQ Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic op- timal control. arXiv preprint arXiv:2409.08861, 2024. 3, 4, 9

work page arXiv 2024

[10] [10]

Structure- aware video generation with latent diffusion models

Patrick Esser, Robin Rombach, and Bj¨orn Ommer. Structure- aware video generation with latent diffusion models. arXiv preprint arXiv:2303.07332, 2023. 1, 2, 8

work page arXiv 2023

[11] [11]

The vendi score: A diversity evaluation metric for machine learning

Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning. arXiv preprint arXiv:2210.02410, 2022. 9

work page arXiv 2022

[12] [12]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. 2

work page 2014

[13] [13]

Flexible diffusion modeling of long videos.arXiv preprint arXiv:2205.11495, 2022

William Harvey, Søren Nørskov, Niklas K ¨olch, and George V ogiatzis. Flexible diffusion modeling of long videos.arXiv preprint arXiv:2205.11495, 2022. 2

work page arXiv 2022

[14] [14]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video genera- tion

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video genera- tion. arXiv preprint arXiv:2406.15252, 2024. 2, 3

work page arXiv 2024

[15] [15]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. In Advances in Neural Informa- tion Processing Systems, pages 6840–6851, 2020. 1, 4

work page 2020

[17] [17]

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffu- sion models. arXiv preprint arXiv:2204.03458, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Yu Hong, Jing Wei, Xing Liu, Xiaodi Wang, Yutong Bai, Haitao Li, Ming Zhang, and Hao Xu. Cogvideo: Large- scale pretraining for text-to-video generation with transform- ers. arXiv preprint arXiv:2205.15868, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Vbench: Comprehensive bench- mark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 7

work page 2024

[20] [20]

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954 ,

work page arXiv

[21] [21]

Text2video-zero: Zero-shot text-to-video genera- tion using pretrained text-to-image diffusion models

Levon Khachatryan, Adrien Davy, Baptiste Emond, and Jun Wang. Text2video-zero: Zero-shot text-to-video genera- tion using pretrained text-to-image diffusion models. arXiv preprint arXiv:2302.01327, 2023. 2

work page arXiv 2023

[22] [22]

Consistency traject ory models: Learning probability ﬂow ode trajectory of diﬀusion

Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Mu- rata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory mod- els: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279, 2023. 2

work page arXiv 2023

[23] [23]

Auto-Encoding Variational Bayes

Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 7

work page internal anchor Pith review Pith/arXiv arXiv 2013

[24] [24]

Pick-a-pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems , 36: 36652–36663, 2023. 7, 2, 3

work page 2023

[25] [25]

Kuaishou. Kling. https://kling.kuaishou.com/ en, 2024. Accessed: [today’s date]. 1, 8

work page 2024

[26] [26]

T2v- turbo: Breaking the quality bottleneck of video consis- tency model with mixed reward feedback

Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sug- ato Basu, Wenhu Chen, and William Yang Wang. T2v- turbo: Breaking the quality bottleneck of video consis- tency model with mixed reward feedback. arXiv preprint arXiv:2405.18750, 2024. 1, 3, 8

work page arXiv 2024

[27] [27]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. arXiv preprint arXiv:2210.02747, 2022. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

Instaflow: One step is enough for high-quality diffusion- based text-to-image generation

Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, et al. Instaflow: One step is enough for high-quality diffusion- based text-to-image generation. In The Twelfth International Conference on Learning Representations, 2023. 3, 4

work page 2023

[30] [30]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 7

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [31]

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. arXiv preprint arXiv:2410.11081, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Dpm-solver: A fast ode solver for diﬀusion probabilistic model sampling in around 10 step s

Chao Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ode solver for diffu- sion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022. 3

work page arXiv 2022

[33] [33]

Knowledge distillation for generative models

Eric Luhman and Tobias Luhman. Knowledge distillation for generative models. arXiv preprint arXiv:2106.05237, 2021. 1, 3

work page arXiv 2021

[34] [34]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023. 3, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Diff-instruct: A universal approach for transferring knowledge from pre-trained diffu- sion models

Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffu- sion models. Advances in Neural Information Processing Systems, 36, 2024. 2, 3

work page 2024

[36] [36]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

work page

[37] [37]

Pika Labs

Pika Labs. Pika Labs. https://www.pika.art/. Ac- cessed: September 25, 2023. 8

work page 2023

[38] [38]

Model compression via distillation and quantization

Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668, 2018. 1

work page internal anchor Pith review Pith/arXiv arXiv 2018

[39] [39]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022. 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2022

[41] [41]

Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737, 2024

Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Kate- rina Fragkiadaki, and Deepak Pathak. Video diffusion align- ment via reward gradients.arXiv preprint arXiv:2407.08737,

work page arXiv

[42] [42]

Diffusion Policy Policy Optimization

Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Ben- jamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 4, 6, 7

work page 2022

[44] [44]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[45] [45]

Multistep distilla- tion of diffusion models via moment matching

Tim Salimans, Thomas Mensink, Jonathan Heek, and Emiel Hoogeboom. Multistep distillation of diffusion models via moment matching. arXiv preprint arXiv:2406.04103, 2024. 2, 3

work page arXiv 2024

[46] [46]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural In- formation Processing Systems , 35:25278–25294, 2022. 2, 3

work page 2022

[47] [47]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Eliya Nachmani, Guy Dahan, Eli Shechtman, and Haggai Hacohen. Make-a-video: Text- to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[48] [48]

Weiss, Niru Mah- eswaranathan, and Surya Ganguli

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Mah- eswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Interna- tional Conference on Machine Learning , pages 2256–2265,

work page

[49] [49]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. In International Conference on Learning Representations, 2021. 3, 1

work page 2021

[50] [50]

Generative modeling by esti- mating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by esti- mating gradients of the data distribution. Advances in neural information processing systems, 32, 2019. 1

work page 2019

[51] [51]

Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. In International Conference on Learning Represen- tations, 2021. 1, 4

work page 2021

[52] [52]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023. 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

Phenaki: Variable Length Video Generation From Open Domain Textual Description

Ruben Villegas, Jiahui Yang, Sergey Tulyakov, Jan Kautz, and Seungjun Hong. Phenaki: Variable length video gener- ation from open domain textual descriptions. arXiv preprint arXiv:2210.02399, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[54] [54]

Animatelcm: Accelerating the animation of personalized diffusion mod- els and adapters with decoupled consistency learning

Fu-Yun Wang, Zhaoyang Huang, Xiaoyu Shi, Weikang Bian, Guanglu Song, Yu Liu, and Hongsheng Li. Animatelcm: Accelerating the animation of personalized diffusion mod- els and adapters with decoupled consistency learning. arXiv preprint arXiv:2402.00769, 2024. 3

work page arXiv 2024

[55] [55]

Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation

Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023. 2

work page 2023

[56] [56]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [57]

Videolcm: Video latent consistency model.arXiv preprint arXiv:2312.09109,

Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, and Nong Sang. Videolcm: Video latent consistency model. arXiv preprint arXiv:2312.09109,

work page arXiv

[58] [58]

Lavie: High-quality video gener- ation with cascaded latent diffusion models

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video gener- ation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023. 2

work page arXiv 2023

[59] [59]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[60] [60]

Internvideo2: Scaling video foundation mod- els for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation mod- els for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024. 2, 3

work page arXiv 2024

[61] [61]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion. Advances in Neural Information Processing Systems , 36, 2024. 2, 3, 5

work page 2024

[62] [62]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 ,

work page internal anchor Pith review Pith/arXiv arXiv

[63] [63]

Human preference score: Better aligning text- to-image models with human preference

Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hong- sheng Li. Human preference score: Better aligning text- to-image models with human preference. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 2096–2105, 2023. 2, 3 14

work page 2096

[64] [64]

Tackling the generative learning trilemma with denoising diffusion gans.arXiv preprint arXiv:2112.07804,

Tianyu Xiao, Dara Bahri, Pawel Lucjan Stanczuk, Duygu Ceylan, Julian McAuley, Arash Vahdat, and Jan Kautz. Tack- ling the generative learning trilemma with denoising diffu- sion gans. arXiv preprint arXiv:2112.07804, 2021. 3

work page arXiv 2021

[65] [65]

Dual diffusion models for high-fidelity video generation

Tong Xiao, Peng Liu, and Yi Yang. Dual diffusion models for high-fidelity video generation. arXiv preprint arXiv:2301.06513, 2023. 2

work page arXiv 2023

[66] [66]

Kingma, Tingbo Hou, Ying Nian Wu, Kevin Patrick Murphy, Tim Salimans, Ben Poole, and Ruiqi Gao

Sirui Xie, Zhisheng Xiao, Diederik P Kingma, Tingbo Hou, Ying Nian Wu, Kevin Patrick Murphy, Tim Salimans, Ben Poole, and Ruiqi Gao. Em distillation for one-step diffusion models. arXiv preprint arXiv:2405.16852, 2024. 2, 3

work page arXiv 2024

[67] [67]

Imagere- ward: Learning and evaluating human preferences for text- to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation. Advances in Neural Information Pro- cessing Systems, 36, 2024. 3, 5, 10, 1, 2

work page 2024

[68] [68]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024. 2, 7, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[69] [69]

Improved distribution matching distillation for fast image synthesis.arXiv preprint arXiv:2405.14867,

Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Im- proved distribution matching distillation for fast image syn- thesis. arXiv preprint arXiv:2405.14867, 2024. 2, 3, 5, 7

work page arXiv 2024

[70] [70]

One-step diffusion with distribution matching distillation

Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition , pages 6613–6623, 2024. 2, 3

work page 2024

[71] [71]

Sf-v: Single forward video generation model

Zhixing Zhang, Yanyu Li, Yushu Wu, Yanwu Xu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Dimitris Metaxas, et al. Sf-v: Single forward video generation model. arXiv preprint arXiv:2406.04324,

work page arXiv

[72] [72]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien- Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experi- ences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[73] [73]

Open-sora: Democratizing efficient video production for all, 2024

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all, 2024. 7 15 DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization Supplementary Material Table of Contents 6 . Derivations 1 6.1 . Proof of Eq. (5) . ...

work page 2024

[74] [74]

More Qualitative Results

Visualization 9 10.1. More Qualitative Results . . . . . . . . . . 9 10.2. Comparison of Reward Model Fine-tuning . 9 10.3. Inference Steps . . . . . . . . . . . . . . . 10 10.4. Diversity . . . . . . . . . . . . . . . . . . . 10 10.5. Prompt Length . . . . . . . . . . . . . . . 10 10.6. Sampling with Various Styles and Motions . 10

work page

[75] [75]

Proof of Eq

Derivations 6.1. Proof of Eq. (5) We start from the forward diffusion process of DDPM [16]. The distribution of one-step diffusion processq(xt|xt−1) = N (xt; √αtxt−1, (1 − αt)I) can be equivalently written as: xt = √αtxt−1 + √ 1 − αtε, ε ∼ N (0, I) (18) with t ∈ [T ]. By chain rule, we have xt = √¯αtx0 + √ 1 − ¯αtε (19) with ¯αt = Πt i=1αi. Equivalently, ...

work page

[76] [76]

Direct Reward Gradient In this section, we discuss in details why the direct reward gradient methods like ReFL [67] and DRaFT [8], cannot fit into the memory efficiently

Reward Model Fine-Tuning 7.1. Direct Reward Gradient In this section, we discuss in details why the direct reward gradient methods like ReFL [67] and DRaFT [8], cannot fit into the memory efficiently. 1 Take the HPSv2 [62] model as an example. It ap- plies fine-tuned version of ViT-H/14 variant of CLIP model, which contains 32 image transformer layers and...

work page 2000

[77] [77]

Human Evaluation Human Evaluation Details

Additional Experimental Results 8.1. Human Evaluation Human Evaluation Details. Fig. 13 displays the user in- terface for human evaluation experiments. The four choices include visual quality, text-video alignment, motion and general preference, which correspond to the four reported metrics in Fig. 6. For the pairwise comparison of meth- ods, the videos a...

work page

[78] [78]

The experiments in Sec

Challenges and Discussions Long Prompt Bias. The experiments in Sec. 4.5 show that, current models perform better for long and more de- scriptive prompt, which is inherited from the teacher model. The reason is hypothesized to be the well-captioned text-to- video training dataset, which emphasize detailed descrip- tions. With longer prompts, the text-vide...

work page

[79] [79]

Cool Baby

Visualization 10.1. More Qualitative Results More qualitative results of our methods (VSD+CD+LRM) are displayed in Fig. 15, 16 and 17. Visual comparison of our methods with baselines in Tab. 2 for generated samples with the same prompt is shown in Fig. 18 and 19. For fair of comparison, we visualize all sampled frames with resolution 192 × 320 as the typi...

work page