Recognition: no theorem link
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
Pith reviewed 2026-05-16 09:07 UTC · model grok-4.3
The pith
Diffusion models can be fine-tuned directly on differentiable rewards by backpropagating gradients through the full sampling process.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Direct Reward Fine-Tuning (DRaFT) enables fine-tuning diffusion models to maximize differentiable reward functions by backpropagating the reward gradient through the full sampling procedure. DRaFT-K truncates backpropagation to the last K steps, and DRaFT-LV supplies lower-variance gradient estimates when K equals 1. The methods achieve strong performance across reward functions and substantially improve the aesthetic quality of images produced by Stable Diffusion 1.4, while also unifying prior gradient-based fine-tuning designs.
What carries the argument
Backpropagating the reward-function gradient through the iterative denoising sampling procedure of the diffusion model.
If this is right
- Strong performance is obtained on multiple reward functions including human preference models.
- The approach outperforms reinforcement-learning-based fine-tuning methods.
- Aesthetic quality of images generated by Stable Diffusion 1.4 improves substantially.
- Truncated and low-variance variants make full backpropagation computationally feasible.
Where Pith is reading between the lines
- The same gradient-through-sampling idea could apply to other iterative generative models whose sampling chains admit differentiation.
- Multiple differentiable rewards could be combined directly for multi-objective tuning without separate weighting schemes.
- Truncation to final steps may transfer to other long-horizon sampling or optimization problems beyond diffusion.
Load-bearing premise
The reward function must be differentiable with respect to the generated samples so that gradients can flow back through the sampling steps.
What would settle it
Training diverges or shows no reward improvement over the base model when gradients are backpropagated through sampling on a held-out evaluation set.
read the original abstract
We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method for fine-tuning diffusion models to maximize differentiable reward functions, such as scores from human preference models. We first show that it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches. We then propose more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance gradient estimates for the case when K=1. We show that our methods work well for a variety of reward functions and can be used to substantially improve the aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw connections between our approach and prior work, providing a unifying perspective on the design space of gradient-based fine-tuning algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Direct Reward Fine-Tuning (DRaFT), a method to fine-tune diffusion models by backpropagating gradients from differentiable reward functions through the full denoising sampling trajectory. It reports that this achieves strong performance outperforming RL-based baselines on multiple rewards, proposes truncated variants DRaFT-K (last K steps) and DRaFT-LV (low-variance estimator for K=1) for efficiency, demonstrates aesthetic improvements on Stable Diffusion 1.4, and unifies the approach with prior gradient-based fine-tuning methods.
Significance. If the central claims hold, the work offers a direct, non-RL alternative for reward optimization in diffusion models that could simplify pipelines while delivering measurable gains in image quality. The unification perspective on the design space of gradient-based methods adds conceptual value, and the empirical results on Stable Diffusion indicate practical utility for preference alignment tasks.
major comments (3)
- [§3] §3 (DRaFT description): the assertion that backpropagation through the full sampling chain (typically 50–1000 steps) is both feasible and yields strong performance is load-bearing for the headline claim, yet no per-step gradient norm statistics, vanishing/exploding analysis, or ablation on total denoising steps for the non-truncated case are reported. The introduction of DRaFT-K/DRaFT-LV precisely to address stability and memory issues leaves open whether the reported gains derive from the full procedure or the approximations.
- [§4] §4 (Experiments): outperformance over RL baselines is claimed across rewards, but the results lack error bars, multiple random seeds, or statistical significance tests. This undermines confidence that the gains are robust rather than attributable to sampling variance, especially given the differentiability and gradient-flow assumptions.
- [§3.1] §3.1 (Method): the memory footprint and wall-clock cost of full backpropagation versus the truncated variants are not quantified, which is critical for assessing the practicality of the non-truncated DRaFT that the abstract presents as the primary result.
minor comments (2)
- [Abstract] The abstract states that the method works 'for a variety of reward functions' without enumerating them; adding a short list would improve readability.
- [§3] Notation for the composite Jacobian through the denoising chain and the exact form of the low-variance estimator in DRaFT-LV could be stated more explicitly to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below and commit to incorporating additional analyses and experiments in the revised version to strengthen the paper.
read point-by-point responses
-
Referee: [§3] §3 (DRaFT description): the assertion that backpropagation through the full sampling chain (typically 50–1000 steps) is both feasible and yields strong performance is load-bearing for the headline claim, yet no per-step gradient norm statistics, vanishing/exploding analysis, or ablation on total denoising steps for the non-truncated case are reported. The introduction of DRaFT-K/DRaFT-LV precisely to address stability and memory issues leaves open whether the reported gains derive from the full procedure or the approximations.
Authors: We thank the referee for highlighting this important aspect. Our experiments do demonstrate that full backpropagation through the sampling chain is feasible and leads to strong performance, as shown in the main results where DRaFT outperforms baselines. However, we agree that additional diagnostics would be valuable. In the revision, we will add per-step gradient norm statistics to analyze potential vanishing or exploding gradients, and include an ablation on the number of denoising steps for the full DRaFT. We will also clarify that the primary results are from the full procedure, with truncated variants presented as efficient alternatives. revision: yes
-
Referee: [§4] §4 (Experiments): outperformance over RL baselines is claimed across rewards, but the results lack error bars, multiple random seeds, or statistical significance tests. This undermines confidence that the gains are robust rather than attributable to sampling variance, especially given the differentiability and gradient-flow assumptions.
Authors: We agree that reporting error bars and using multiple seeds would increase confidence in the results. We will rerun the main experiments with at least three random seeds and report means with standard deviations as error bars. Additionally, we will perform statistical significance tests (e.g., t-tests) to validate the outperformance over RL baselines. revision: yes
-
Referee: [§3.1] §3.1 (Method): the memory footprint and wall-clock cost of full backpropagation versus the truncated variants are not quantified, which is critical for assessing the practicality of the non-truncated DRaFT that the abstract presents as the primary result.
Authors: We acknowledge that quantifying the computational costs is crucial for practicality. In the revised manuscript, we will include measurements of peak memory usage and wall-clock time for training with full DRaFT compared to the DRaFT-K and DRaFT-LV variants, using the experimental setup described in the paper. revision: yes
Circularity Check
No circularity: direct application of backpropagation through sampling
full rationale
The paper's core contribution is the DRaFT method, which applies standard automatic differentiation to propagate gradients from a differentiable reward through the multi-step diffusion sampling chain. This is a direct computational procedure rather than a mathematical derivation that reduces to its own inputs by construction. Performance results are presented as empirical outcomes from training and evaluation on specific rewards, not as predictions forced by fitting parameters to the target metric itself. The truncated variants DRaFT-K and DRaFT-LV are introduced for practical efficiency and variance reduction but do not create self-referential loops in the central claim. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify the feasibility or superiority of full backpropagation; the approach remains self-contained against external benchmarks such as RL baselines.
Axiom & Free-Parameter Ledger
free parameters (1)
- K (truncation steps)
axioms (1)
- domain assumption The diffusion sampling process is differentiable with respect to model parameters.
Forward citations
Cited by 20 Pith papers
-
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
-
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
-
Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models
Adaptive guidance trajectories learned via PPO outperform fixed-scale CFG on controllability-quality balance in three controlled NLP generation tasks with discrete diffusion models.
-
Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline
A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.
-
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.
-
Long-Text-to-Image Generation via Compositional Prompt Decomposition
PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models whil...
-
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
-
Step-level Denoising-time Diffusion Alignment with Multiple Objectives
MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.
-
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.
-
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
-
Bird-SR: Bidirectional Reward-Guided Diffusion for Real-World Image Super-Resolution
Bird-SR outperforms prior super-resolution methods on real images by guiding diffusion trajectories with bidirectional rewards, early structure optimization on synthetic pairs, and later perceptual rewards with dynami...
-
CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis
CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.
-
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
Bias at the End of the Score
Reward models used as quality scorers in text-to-image generation encode demographic biases that cause reward-guided training to sexualize female subjects, reinforce stereotypes, and reduce diversity.
-
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
-
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...
-
Improving Video Generation with Human Feedback
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
-
Diffusion Policy Policy Optimization
DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.
-
Towards General Preference Alignment: Diffusion Models at Nash Equilibrium
Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.
Reference graph
Works this paper leans on
-
[1]
A General Language Assistant as a Laboratory for Alignment
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy J...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Training Deep Nets with Sublinear Memory Cost
URL http://github.com/google/jax. Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
RAFT: Reward ranked finetuning for generative foundation model alignment
Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767,
-
[7]
Optimizing DDPM Sampling with Shortcut Fine-Tuning
10 Published as a conference paper at ICLR 2024 Ying Fan and Kangwook Lee. Optimizing DDPM Sampling with Shortcut Fine-Tuning. In Interna- tional Conference on Machine Learning (ICML),
work page 2024
-
[8]
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese, Nat McAleese, Maja Tr˛ ebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Optimizing prompts for text-to-image generation
Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611,
-
[10]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Pick-a-pic: An open dataset of user preferences for text-to-image generation
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569,
-
[12]
Aligning Text-to-Image Models using Human Feedback
Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
UPainting: Unified text-to-image diffusion generation with cross-modal guidance
Wei Li, Xue Xu, Xinyan Xiao, Jiachen Liu, Hu Yang, Guohao Li, Zhanpeng Wang, Zhifan Feng, Qiaoqiao She, Yajuan Lyu, et al. UPainting: Unified text-to-image diffusion generation with cross-modal guidance. arXiv preprint arXiv:2210.16031,
-
[14]
Continuous control with deep reinforcement learning
11 Published as a conference paper at ICLR 2024 Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations,
work page 2024
-
[15]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Photorealistic text-to-image diffusion models with deep language understanding
12 Published as a conference paper at ICLR 2024 Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems,
work page 2024
-
[18]
End-to-end diffusion latent optimiza- tion improves classifier guidance
Bram Wallace, Akash Gokul, Stefano Ermon, and Nikhil Naik. End-to-end diffusion latent optimiza- tion improves classifier guidance. arXiv preprint arXiv:2303.13703,
-
[19]
ArXiv preprint, abs/2109.10862
Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862,
-
[20]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023a. 13 Published as a conference paper at ICLR 2024 Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Better alig...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Fine-Tuning Language Models from Human Preferences
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[22]
• In Appendix B, we provide additional results and analysis
14 Published as a conference paper at ICLR 2024 APPENDIX This appendix is structured as follows: • In Appendix A, we provide more details of our experimental setup, including hyperparame- ters and baselines. • In Appendix B, we provide additional results and analysis. • In Appendix C, we discuss two other methods we explored that did not achieve as strong...
work page 2024
-
[23]
Large-scale training runs take around 8 hours on 16 TPUv4s
Small-scale training runs take around 1.5 hours on 4 TPUv4s. Large-scale training runs take around 8 hours on 16 TPUv4s. Hyperparameter Small-Scale Experiments Large-scale Experiments Learning rate 4e-4 2e-4 Batch size 4 16 Train steps 2k 10k LoRA inner dimension 8 32 Weight decay 0.1 0.1 DDIM steps 50 50 Guidance weight 7.5 7.5 DRaFT-LV inner loopsn 2 2 ...
work page 2024
-
[24]
Human Preference Score v2 (HPSv2; Wu et al
image encoder pre-trained on a variety of human-labeled sources. Human Preference Score v2 (HPSv2; Wu et al. 2023a) is trained on prompts from DiffusionDB (Wang et al., 2023b) and COCO Captions (Chen et al., 2015). Images are generated from each using a wide range of generative models. Human annotators then rate which of two images is preferred for a prom...
work page 2015
-
[25]
The method trains the model to increase the likelihoods of actions (i.e
to diffusion model generation. The method trains the model to increase the likelihoods of actions (i.e. sampling steps) that produce high-reward samples. DDPO employs additional tricks such as importance weight clipping, reward normalization, and classifier-free guidance-aware training to improve results. We show results from DDPO-IS, which extends DDPO t...
work page 2023
-
[26]
Rather than using an invertible diffusion algorithm to reduce memory costs as in Wallace et al
backprop- agates through diffusion sampling to update the initial noise xT so that the resulting generation produces a high reward. Rather than using an invertible diffusion algorithm to reduce memory costs as in Wallace et al. (2023), we simply use gradient checkpointing as with DRaFT. We optimizexT with Adam (which worked slightly better for us than mom...
work page 2023
-
[27]
We used a ResNet-50 model pre-trained on ImageNet-1k to classify images generated by Stable Diffusion; as the reward for fine-tuning, we used the negative cross-entropy to a fixed target class. Cross-Entropy Target: “cat”“mouse” (Fixed) Pre-Trained ImageNet Classifier Diffusion Logits Prompt Figure 18: Generating “diffusion adversarial examples” that are ...
work page 2024
-
[28]
Is it necessary to apply the LoRA parameters throughout the full sampling chain? What is the impact of applying the LoRA parameters to only the initial or final portions of the trajectory? 20 Published as a conference paper at ICLR 2024 t = 5 t = 4 t = 3 t = 2 t = 1 t = 0 LoRA Start Step Pre-Trained Parameters Only LoRA-Adapted Parameters t = 5 t = 4 t = ...
work page 2024
-
[29]
Start: Generated Images 2 5 10 20 30 40 50 Figure 26: Images generated with different LoRA start steps. Here, the first T − M steps of the diffusion sampling chain use the pre-trained diffusion model parameters, while the last M steps use the LoRA-adapted parameters. The LoRA start step M increases from top to bottom, M ∈ {2, 5, 10, 20, 30, 40, 50}. We ob...
work page 2024
-
[30]
the Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018). Neither was satisfactory to increase diversity: in Figure 30 (Right), we 23 Published as a conference paper at ICLR 2024 show images generated using a large weight on the LPIPS diversity term, which increased inner-batch diversity at the expense of the aesthetic score. Original Im...
work page 2018
-
[31]
post-hoc elimination of NSFW images that have been generated. These approaches are complementary, and in practice we imagine that all three would be used in concert. Here, we focus on stage (2), by fine-tuning diffusion models using feedback from a pre-trained NSFW classifier. This approach is particularly useful for ambiguous prompts that may lead to eit...
work page 2022
-
[32]
• Negative prompt: We learn prompt-dependent text embeddings passed into the uncondi- tional UNet used for classifier-free guidance. In particular, we parameterize the embedding as MLP(Multi-Headed-Self-Attention(e)) where e is the text embeddings produced by the diffusion model’s text encoder. However, in practice, we found that this did not improve resu...
work page 2024
-
[33]
offers a way of employing reward gradients within the RL framework. Similarly to Black et al. (2023), we can view DDIM sampling as applying a deterministic policy to a deterministic Markov decision process: • A state st consists of the current latent, timestep, and prompt (xt, t, c). • The initial state s0 draws xT ∼ N (0, I), c ∼ pc, and sets t = T . • T...
work page 2023
-
[34]
are a class of generative models that have become the de-facto standard for most continuous data modalities, including images (Ramesh et al., 2021), videos (Ho et al., 2022; Singer et al., 2023), audio (Liu et al., 2023), and 3D models (Zeng et al., 2022; Poole et al., 2023; Gu et al., 2023). Text-to-image diffusion models, which generate images condition...
work page 2021
-
[35]
and Human Preference Score v2 (Wu et al., 2023a), which are trained on human judgements between pairs of images generated by diffusion models for the same prompt. Guidance. Guidance steers the sampling process towards images that satisfy a desired objective by adding an auxiliary term to the score function. The canonical example is classifier guidance (So...
work page 2021
-
[36]
and (2) applying a pre-trained guidance model (which has seen only non-noisy data) to the one-step-denoised images obtained by removing the predicted noise at time t (Li et al., 2022). However, both approaches have downsides: (1) precludes the use of off-the-shelf pre-trained guidance functions and (2) means the guidance is applied to out-of-distribution ...
work page 2022
-
[37]
Rather than optimizing model parameters, DOODL optimizes the initial noise sample xT
uses backpropagation through sampling to improve image generation with respect to differentiable objectives. Rather than optimizing model parameters, DOODL optimizes the initial noise sample xT . Although DOODL does not require a training phase, it is much slower at inference time because the optimization must be redone for each prompt and metric. In cont...
work page 2024
-
[38]
of sampling, where K is deterministic; in (Xu et al., 2023), the authors randomly choose an iteration between a min and max step of the sampling chain (which incurs more hyperparameters) from which to predict the clean image. Also, because DRaFT runs the full sampling chain, our reward functions are always evaluated on final generations. In contrast, ReFL...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.