arxiv: 2309.17400 · v2 · submitted 2023-09-29 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

Kevin Clark , Paul Vicol , Kevin Swersky , David J Fleet

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:07 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords diffusion modelsfine-tuningdifferentiable rewardsgradient backpropagationimage generationDRaFTreinforcement learning

0 comments

The pith

Diffusion models can be fine-tuned directly on differentiable rewards by backpropagating gradients through the full sampling process.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that gradients from a reward function can be propagated backward through every step of a diffusion model's denoising chain to update the model parameters. This direct method, called DRaFT, delivers stronger results than reinforcement-learning fine-tuning on rewards such as aesthetic scores and human preference models. Truncated variants limit the backpropagation to the final K steps or reduce variance for the single-step case, keeping computation practical. A reader would care because the approach removes the need for separate RL policy optimization while still aligning generated images with desired properties.

Core claim

Direct Reward Fine-Tuning (DRaFT) enables fine-tuning diffusion models to maximize differentiable reward functions by backpropagating the reward gradient through the full sampling procedure. DRaFT-K truncates backpropagation to the last K steps, and DRaFT-LV supplies lower-variance gradient estimates when K equals 1. The methods achieve strong performance across reward functions and substantially improve the aesthetic quality of images produced by Stable Diffusion 1.4, while also unifying prior gradient-based fine-tuning designs.

What carries the argument

Backpropagating the reward-function gradient through the iterative denoising sampling procedure of the diffusion model.

If this is right

Strong performance is obtained on multiple reward functions including human preference models.
The approach outperforms reinforcement-learning-based fine-tuning methods.
Aesthetic quality of images generated by Stable Diffusion 1.4 improves substantially.
Truncated and low-variance variants make full backpropagation computationally feasible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gradient-through-sampling idea could apply to other iterative generative models whose sampling chains admit differentiation.
Multiple differentiable rewards could be combined directly for multi-objective tuning without separate weighting schemes.
Truncation to final steps may transfer to other long-horizon sampling or optimization problems beyond diffusion.

Load-bearing premise

The reward function must be differentiable with respect to the generated samples so that gradients can flow back through the sampling steps.

What would settle it

Training diverges or shows no reward improvement over the base model when gradients are backpropagated through sampling on a held-out evaluation set.

read the original abstract

We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method for fine-tuning diffusion models to maximize differentiable reward functions, such as scores from human preference models. We first show that it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches. We then propose more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance gradient estimates for the case when K=1. We show that our methods work well for a variety of reward functions and can be used to substantially improve the aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw connections between our approach and prior work, providing a unifying perspective on the design space of gradient-based fine-tuning algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DRaFT shows direct backpropagation through diffusion sampling beats RL baselines on reward tasks, but the full-chain version's stability is unproven and the gains likely rely on the truncated variants.

read the letter

The main point is that you can differentiate a reward directly through the entire denoising chain in a diffusion model and get stronger results than RL fine-tuning on several objectives. They demonstrate this on Stable Diffusion 1.4 for aesthetic scores and other rewards, with the basic DRaFT version outperforming the RL baselines they compare against. The paper also introduces DRaFT-K (truncating gradients to the last K steps) and DRaFT-LV (low-variance estimator for K=1) to make the approach more practical, plus a short unifying discussion of gradient-based fine-tuning methods. That last part is useful for anyone already working in this area. The experiments appear to show clear improvements in generated image quality without needing policy gradients or value functions. The soft spot is exactly the one the stress test flags: the headline claim is about backprop through the full sampling trajectory, yet the paper introduces the truncated versions specifically to address memory and variance problems. No gradient norm statistics or ablations on total step count for the non-truncated case are mentioned, so it is hard to tell whether the full procedure runs stably at scale or whether the reported wins come mostly from the K-truncated runs. Memory footprint for the complete chain is also left unaddressed. This is still worth a serious referee. The core mechanism is straightforward, the results on the truncated versions look usable, and the work sits at the intersection of diffusion fine-tuning and differentiable optimization where people are actively looking for simpler alternatives to RL. A reader working on reward-based alignment or efficient adaptation of generative models would get concrete value from the method and the comparisons. I would bring it to a reading group for discussion of the truncation trade-offs and would cite the DRaFT-K results if I needed a baseline for direct reward tuning. Send it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper introduces Direct Reward Fine-Tuning (DRaFT), a method to fine-tune diffusion models by backpropagating gradients from differentiable reward functions through the full denoising sampling trajectory. It reports that this achieves strong performance outperforming RL-based baselines on multiple rewards, proposes truncated variants DRaFT-K (last K steps) and DRaFT-LV (low-variance estimator for K=1) for efficiency, demonstrates aesthetic improvements on Stable Diffusion 1.4, and unifies the approach with prior gradient-based fine-tuning methods.

Significance. If the central claims hold, the work offers a direct, non-RL alternative for reward optimization in diffusion models that could simplify pipelines while delivering measurable gains in image quality. The unification perspective on the design space of gradient-based methods adds conceptual value, and the empirical results on Stable Diffusion indicate practical utility for preference alignment tasks.

major comments (3)

[§3] §3 (DRaFT description): the assertion that backpropagation through the full sampling chain (typically 50–1000 steps) is both feasible and yields strong performance is load-bearing for the headline claim, yet no per-step gradient norm statistics, vanishing/exploding analysis, or ablation on total denoising steps for the non-truncated case are reported. The introduction of DRaFT-K/DRaFT-LV precisely to address stability and memory issues leaves open whether the reported gains derive from the full procedure or the approximations.
[§4] §4 (Experiments): outperformance over RL baselines is claimed across rewards, but the results lack error bars, multiple random seeds, or statistical significance tests. This undermines confidence that the gains are robust rather than attributable to sampling variance, especially given the differentiability and gradient-flow assumptions.
[§3.1] §3.1 (Method): the memory footprint and wall-clock cost of full backpropagation versus the truncated variants are not quantified, which is critical for assessing the practicality of the non-truncated DRaFT that the abstract presents as the primary result.

minor comments (2)

[Abstract] The abstract states that the method works 'for a variety of reward functions' without enumerating them; adding a short list would improve readability.
[§3] Notation for the composite Jacobian through the denoising chain and the exact form of the low-variance estimator in DRaFT-LV could be stated more explicitly to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and commit to incorporating additional analyses and experiments in the revised version to strengthen the paper.

read point-by-point responses

Referee: [§3] §3 (DRaFT description): the assertion that backpropagation through the full sampling chain (typically 50–1000 steps) is both feasible and yields strong performance is load-bearing for the headline claim, yet no per-step gradient norm statistics, vanishing/exploding analysis, or ablation on total denoising steps for the non-truncated case are reported. The introduction of DRaFT-K/DRaFT-LV precisely to address stability and memory issues leaves open whether the reported gains derive from the full procedure or the approximations.

Authors: We thank the referee for highlighting this important aspect. Our experiments do demonstrate that full backpropagation through the sampling chain is feasible and leads to strong performance, as shown in the main results where DRaFT outperforms baselines. However, we agree that additional diagnostics would be valuable. In the revision, we will add per-step gradient norm statistics to analyze potential vanishing or exploding gradients, and include an ablation on the number of denoising steps for the full DRaFT. We will also clarify that the primary results are from the full procedure, with truncated variants presented as efficient alternatives. revision: yes
Referee: [§4] §4 (Experiments): outperformance over RL baselines is claimed across rewards, but the results lack error bars, multiple random seeds, or statistical significance tests. This undermines confidence that the gains are robust rather than attributable to sampling variance, especially given the differentiability and gradient-flow assumptions.

Authors: We agree that reporting error bars and using multiple seeds would increase confidence in the results. We will rerun the main experiments with at least three random seeds and report means with standard deviations as error bars. Additionally, we will perform statistical significance tests (e.g., t-tests) to validate the outperformance over RL baselines. revision: yes
Referee: [§3.1] §3.1 (Method): the memory footprint and wall-clock cost of full backpropagation versus the truncated variants are not quantified, which is critical for assessing the practicality of the non-truncated DRaFT that the abstract presents as the primary result.

Authors: We acknowledge that quantifying the computational costs is crucial for practicality. In the revised manuscript, we will include measurements of peak memory usage and wall-clock time for training with full DRaFT compared to the DRaFT-K and DRaFT-LV variants, using the experimental setup described in the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: direct application of backpropagation through sampling

full rationale

The paper's core contribution is the DRaFT method, which applies standard automatic differentiation to propagate gradients from a differentiable reward through the multi-step diffusion sampling chain. This is a direct computational procedure rather than a mathematical derivation that reduces to its own inputs by construction. Performance results are presented as empirical outcomes from training and evaluation on specific rewards, not as predictions forced by fitting parameters to the target metric itself. The truncated variants DRaFT-K and DRaFT-LV are introduced for practical efficiency and variance reduction but do not create self-referential loops in the central claim. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify the feasibility or superiority of full backpropagation; the approach remains self-contained against external benchmarks such as RL baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach assumes differentiability of the reward and sampling chain; no new entities are introduced, and free parameters appear limited to the truncation length K in variants.

free parameters (1)

K (truncation steps)
Hyperparameter controlling how many final sampling steps receive gradients in DRaFT-K.

axioms (1)

domain assumption The diffusion sampling process is differentiable with respect to model parameters.
Required for backpropagation through the full chain; stated implicitly in the method description.

pith-pipeline@v0.9.0 · 5465 in / 1182 out tokens · 16455 ms · 2026-05-16T09:07:34.450537+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Flow-GRPO: Training Flow Matching Models via Online RL
cs.CV 2025-05 unverdicted novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
cs.CV 2026-05 unverdicted novelty 7.0

PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Adaptive guidance trajectories learned via PPO outperform fixed-scale CFG on controllability-quality balance in three controlled NLP generation tasks with discrete diffusion models.
Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline
cs.AI 2026-05 unverdicted novelty 7.0

A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
cs.LG 2026-04 unverdicted novelty 7.0

FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.
Long-Text-to-Image Generation via Compositional Prompt Decomposition
cs.CV 2026-04 unverdicted novelty 7.0

PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models whil...
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
cs.CV 2026-04 unverdicted novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
Step-level Denoising-time Diffusion Alignment with Multiple Objectives
cs.LG 2026-04 unverdicted novelty 7.0

MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
cs.AI 2026-04 unverdicted novelty 7.0

FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
cs.CV 2026-03 unverdicted novelty 7.0

SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
Bird-SR: Bidirectional Reward-Guided Diffusion for Real-World Image Super-Resolution
cs.CV 2026-02 unverdicted novelty 7.0

Bird-SR outperforms prior super-resolution methods on real images by guiding diffusion trajectories with bidirectional rewards, early structure optimization on synthetic pairs, and later perceptual rewards with dynami...
CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis
cs.CV 2026-05 unverdicted novelty 6.0

CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
Response Time Enhances Alignment with Heterogeneous Preferences
cs.LG 2026-05 unverdicted novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
Bias at the End of the Score
cs.CV 2026-04 unverdicted novelty 6.0

Reward models used as quality scorers in text-to-image generation encode demographic biases that cause reward-guided training to sexualize female subjects, reinforce stereotypes, and reduce diversity.
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
cs.LG 2026-04 unverdicted novelty 6.0

Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
cs.AI 2026-04 unverdicted novelty 6.0

VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...
Improving Video Generation with Human Feedback
cs.CV 2025-01 unverdicted novelty 6.0

A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
Diffusion Policy Policy Optimization
cs.RO 2024-09 unverdicted novelty 6.0

DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.
Towards General Preference Alignment: Diffusion Models at Nash Equilibrium
cs.LG 2026-05 unverdicted novelty 5.0

Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 19 Pith papers · 12 internal anchors

[1]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy J...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training Deep Nets with Sublinear Memory Cost

URL http://github.com/google/jax. Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

RAFT: Reward ranked finetuning for generative foundation model alignment

Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767,

work page arXiv
[7]

Optimizing DDPM Sampling with Shortcut Fine-Tuning

10 Published as a conference paper at ICLR 2024 Ying Fan and Kangwook Lee. Optimizing DDPM Sampling with Shortcut Fine-Tuning. In Interna- tional Conference on Machine Learning (ICML),

work page 2024
[8]

Improving alignment of dialogue agents via targeted human judgements

Amelia Glaese, Nat McAleese, Maja Tr˛ ebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Optimizing prompts for text-to-image generation

Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611,

work page arXiv
[10]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Pick-a-pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569,

work page arXiv
[12]

Aligning Text-to-Image Models using Human Feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

UPainting: Unified text-to-image diffusion generation with cross-modal guidance

Wei Li, Xue Xu, Xinyan Xiao, Jiachen Liu, Hu Yang, Guohao Li, Zhanpeng Wang, Zhifan Feng, Qiaoqiao She, Yajuan Lyu, et al. UPainting: Unified text-to-image diffusion generation with cross-modal guidance. arXiv preprint arXiv:2210.16031,

work page arXiv
[14]

Continuous control with deep reinforcement learning

11 Published as a conference paper at ICLR 2024 Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations,

work page 2024
[15]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Photorealistic text-to-image diffusion models with deep language understanding

12 Published as a conference paper at ICLR 2024 Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems,

work page 2024
[18]

End-to-end diffusion latent optimiza- tion improves classifier guidance

Bram Wallace, Akash Gokul, Stefano Ermon, and Nikhil Naik. End-to-end diffusion latent optimiza- tion improves classifier guidance. arXiv preprint arXiv:2303.13703,

work page arXiv
[19]

ArXiv preprint, abs/2109.10862

Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862,

work page arXiv
[20]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023a. 13 Published as a conference paper at ICLR 2024 Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Better alig...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[22]

• In Appendix B, we provide additional results and analysis

14 Published as a conference paper at ICLR 2024 APPENDIX This appendix is structured as follows: • In Appendix A, we provide more details of our experimental setup, including hyperparame- ters and baselines. • In Appendix B, we provide additional results and analysis. • In Appendix C, we discuss two other methods we explored that did not achieve as strong...

work page 2024
[23]

Large-scale training runs take around 8 hours on 16 TPUv4s

Small-scale training runs take around 1.5 hours on 4 TPUv4s. Large-scale training runs take around 8 hours on 16 TPUv4s. Hyperparameter Small-Scale Experiments Large-scale Experiments Learning rate 4e-4 2e-4 Batch size 4 16 Train steps 2k 10k LoRA inner dimension 8 32 Weight decay 0.1 0.1 DDIM steps 50 50 Guidance weight 7.5 7.5 DRaFT-LV inner loopsn 2 2 ...

work page 2024
[24]

Human Preference Score v2 (HPSv2; Wu et al

image encoder pre-trained on a variety of human-labeled sources. Human Preference Score v2 (HPSv2; Wu et al. 2023a) is trained on prompts from DiffusionDB (Wang et al., 2023b) and COCO Captions (Chen et al., 2015). Images are generated from each using a wide range of generative models. Human annotators then rate which of two images is preferred for a prom...

work page 2015
[25]

The method trains the model to increase the likelihoods of actions (i.e

to diffusion model generation. The method trains the model to increase the likelihoods of actions (i.e. sampling steps) that produce high-reward samples. DDPO employs additional tricks such as importance weight clipping, reward normalization, and classifier-free guidance-aware training to improve results. We show results from DDPO-IS, which extends DDPO t...

work page 2023
[26]

Rather than using an invertible diffusion algorithm to reduce memory costs as in Wallace et al

backprop- agates through diffusion sampling to update the initial noise xT so that the resulting generation produces a high reward. Rather than using an invertible diffusion algorithm to reduce memory costs as in Wallace et al. (2023), we simply use gradient checkpointing as with DRaFT. We optimizexT with Adam (which worked slightly better for us than mom...

work page 2023
[27]

cat”“mouse

We used a ResNet-50 model pre-trained on ImageNet-1k to classify images generated by Stable Diffusion; as the reward for fine-tuning, we used the negative cross-entropy to a fixed target class. Cross-Entropy Target: “cat”“mouse” (Fixed) Pre-Trained ImageNet Classifier Diffusion Logits Prompt Figure 18: Generating “diffusion adversarial examples” that are ...

work page 2024
[28]

turned off

Is it necessary to apply the LoRA parameters throughout the full sampling chain? What is the impact of applying the LoRA parameters to only the initial or final portions of the trajectory? 20 Published as a conference paper at ICLR 2024 t = 5 t = 4 t = 3 t = 2 t = 1 t = 0 LoRA Start Step Pre-Trained Parameters Only LoRA-Adapted Parameters t = 5 t = 4 t = ...

work page 2024
[29]

O VER -O PTIMIZATION

Start: Generated Images 2 5 10 20 30 40 50 Figure 26: Images generated with different LoRA start steps. Here, the first T − M steps of the diffusion sampling chain use the pre-trained diffusion model parameters, while the last M steps use the LoRA-adapted parameters. The LoRA start step M increases from top to bottom, M ∈ {2, 5, 10, 20, 30, 40, 50}. We ob...

work page 2024
[30]

a raccoon washing dishes

the Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018). Neither was satisfactory to increase diversity: in Figure 30 (Right), we 23 Published as a conference paper at ICLR 2024 show images generated using a large weight on the LPIPS diversity term, which increased inner-batch diversity at the expense of the aesthetic score. Original Im...

work page 2018
[31]

swimsuit

post-hoc elimination of NSFW images that have been generated. These approaches are complementary, and in practice we imagine that all three would be used in concert. Here, we focus on stage (2), by fine-tuning diffusion models using feedback from a pre-trained NSFW classifier. This approach is particularly useful for ambiguous prompts that may lead to eit...

work page 2022
[32]

In particular, we parameterize the embedding as MLP(Multi-Headed-Self-Attention(e)) where e is the text embeddings produced by the diffusion model’s text encoder

• Negative prompt: We learn prompt-dependent text embeddings passed into the uncondi- tional UNet used for classifier-free guidance. In particular, we parameterize the embedding as MLP(Multi-Headed-Self-Attention(e)) where e is the text embeddings produced by the diffusion model’s text encoder. However, in practice, we found that this did not improve resu...

work page 2024
[33]

Similarly to Black et al

offers a way of employing reward gradients within the RL framework. Similarly to Black et al. (2023), we can view DDIM sampling as applying a deterministic policy to a deterministic Markov decision process: • A state st consists of the current latent, timestep, and prompt (xt, t, c). • The initial state s0 draws xT ∼ N (0, I), c ∼ pc, and sets t = T . • T...

work page 2023
[34]

are a class of generative models that have become the de-facto standard for most continuous data modalities, including images (Ramesh et al., 2021), videos (Ho et al., 2022; Singer et al., 2023), audio (Liu et al., 2023), and 3D models (Zeng et al., 2022; Poole et al., 2023; Gu et al., 2023). Text-to-image diffusion models, which generate images condition...

work page 2021
[35]

Guidance

and Human Preference Score v2 (Wu et al., 2023a), which are trained on human judgements between pairs of images generated by diffusion models for the same prompt. Guidance. Guidance steers the sampling process towards images that satisfy a desired objective by adding an auxiliary term to the score function. The canonical example is classifier guidance (So...

work page 2021
[36]

and (2) applying a pre-trained guidance model (which has seen only non-noisy data) to the one-step-denoised images obtained by removing the predicted noise at time t (Li et al., 2022). However, both approaches have downsides: (1) precludes the use of off-the-shelf pre-trained guidance functions and (2) means the guidance is applied to out-of-distribution ...

work page 2022
[37]

Rather than optimizing model parameters, DOODL optimizes the initial noise sample xT

uses backpropagation through sampling to improve image generation with respect to differentiable objectives. Rather than optimizing model parameters, DOODL optimizes the initial noise sample xT . Although DOODL does not require a training phase, it is much slower at inference time because the optimization must be redone for each prompt and metric. In cont...

work page 2024
[38]

Also, because DRaFT runs the full sampling chain, our reward functions are always evaluated on final generations

of sampling, where K is deterministic; in (Xu et al., 2023), the authors randomly choose an iteration between a min and max step of the sampling chain (which incurs more hyperparameters) from which to predict the clean image. Also, because DRaFT runs the full sampling chain, our reward functions are always evaluated on final generations. In contrast, ReFL...

work page 2023