{"paper":{"title":"Directly Fine-Tuning Diffusion Models on Differentiable Rewards","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Diffusion models can be fine-tuned directly on differentiable rewards by backpropagating gradients through the full sampling process.","cross_cats":["cs.LG"],"primary_cat":"cs.CV","authors_text":"David J Fleet, Kevin Clark, Kevin Swersky, Paul Vicol","submitted_at":"2023-09-29T17:01:02Z","abstract_excerpt":"We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method for fine-tuning diffusion models to maximize differentiable reward functions, such as scores from human preference models. We first show that it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches. We then propose more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to only the last K steps of sampling, and DRaFT-LV, which obtains l"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The reward function must be differentiable with respect to the generated samples, and the sampling process must allow stable gradient flow without excessive variance or memory issues.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"DRaFT fine-tunes diffusion models by differentiating through sampling to maximize rewards, outperforming RL baselines and improving aesthetics on Stable Diffusion 1.4.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Diffusion models can be fine-tuned directly on differentiable rewards by backpropagating gradients through the full sampling process.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"76d1f39cd3f96955bacd3e92e14d5aa91b21053a72263177b9254b134498e511"},"source":{"id":"2309.17400","kind":"arxiv","version":2},"verdict":{"id":"3c502d07-8b04-4296-893c-d3ddb0fb3834","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T09:07:34.450537Z","strongest_claim":"it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches.","one_line_summary":"DRaFT fine-tunes diffusion models by differentiating through sampling to maximize rewards, outperforming RL baselines and improving aesthetics on Stable Diffusion 1.4.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The reward function must be differentiable with respect to the generated samples, and the sampling process must allow stable gradient flow without excessive variance or memory issues.","pith_extraction_headline":"Diffusion models can be fine-tuned directly on differentiable rewards by backpropagating gradients through the full sampling process."},"references":{"count":38,"sample":[{"doi":"","year":null,"title":"A General Language Assistant as a Laboratory for Alignment","work_id":"a43f9ea0-01be-47d5-b8ee-a1a9f73381c5","ref_index":1,"cited_arxiv_id":"2112.00861","is_internal_anchor":true},{"doi":"","year":null,"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","ref_index":2,"cited_arxiv_id":"2204.05862","is_internal_anchor":true},{"doi":"","year":null,"title":"Training Diffusion Models with Reinforcement Learning","work_id":"67684dda-3930-452a-b91a-36cbb8e2e219","ref_index":3,"cited_arxiv_id":"2305.13301","is_internal_anchor":true},{"doi":"","year":null,"title":"Training Deep Nets with Sublinear Memory Cost","work_id":"f2c5c287-a500-40e4-a136-e7e3172db1d7","ref_index":4,"cited_arxiv_id":"1604.06174","is_internal_anchor":true},{"doi":"","year":null,"title":"Microsoft COCO Captions: Data Collection and Evaluation Server","work_id":"b3d6fb46-4169-4a28-8f7e-2ca6774211da","ref_index":5,"cited_arxiv_id":"1504.00325","is_internal_anchor":true}],"resolved_work":38,"snapshot_sha256":"d5b7ffae4c3f191538de78d970ba5218df394fe688af195c965ef5a416ecc066","internal_anchors":13},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}