Training Diffusion Models with Reinforcement Learning

Ilya Kostrikov; Kevin Black; Michael Janner; Sergey Levine; Yilun Du

arxiv: 2305.13301 · v4 · submitted 2023-05-22 · 💻 cs.LG · cs.AI· cs.CV

Training Diffusion Models with Reinforcement Learning

Kevin Black , Michael Janner , Yilun Du , Ilya Kostrikov , Sergey Levine This is my paper

Pith reviewed 2026-05-11 20:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords diffusion modelsreinforcement learningpolicy gradientstext-to-image generationDDPOhuman feedbackgenerative modelingreward optimization

0 comments

The pith

Diffusion models can be optimized directly for human feedback and practical objectives like compressibility by treating denoising as a multi-step decision process and applying policy gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that diffusion models, usually trained only to approximate log-likelihood, can be adapted using reinforcement learning to target downstream goals that are hard to specify in prompts. Framing the iterative denoising steps as actions in a Markov decision process enables a family of policy gradient methods called DDPO. These methods outperform simpler reward-weighted likelihood training on tasks such as maximizing image compressibility and aesthetic quality scores from human preferences. A reader should care because the approach allows fine-tuning generative models on rewards from vision-language models without collecting new labeled data, directly improving alignment.

Core claim

By posing the denoising process as a multi-step decision-making problem, a class of policy gradient algorithms called denoising diffusion policy optimization (DDPO) can be used to directly optimize diffusion models for objectives such as image compressibility and aesthetic quality derived from human feedback, proving more effective than reward-weighted likelihood approaches. DDPO also improves prompt-image alignment when a vision-language model supplies the reward signal.

What carries the argument

Denoising diffusion policy optimization (DDPO), a policy gradient method that treats the full denoising trajectory as an MDP and updates the diffusion policy to maximize expected reward.

If this is right

Text-to-image diffusion models can be fine-tuned to produce more compressible images without any change to the original training data or prompts.
Aesthetic quality can be directly maximized using scalar rewards from human raters or pretrained scorers.
Prompt-image alignment can be improved by using a fixed vision-language model to generate rewards, eliminating the need for additional human annotation.
The same policy-gradient machinery applies to any downstream objective that can be expressed as a scalar reward over generated images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The MDP framing could be tested on other iterative generative processes such as autoregressive sampling or score-based models in non-image domains.
Reward functions derived from safety classifiers could be plugged in to reduce generation of harmful content without retraining from scratch.
Variance-reduction techniques standard in RL might further stabilize DDPO when rewards are sparse or delayed across many denoising steps.

Load-bearing premise

The multi-step denoising process can be treated as a Markov decision process whose policy gradients remain stable and effective without prohibitive variance or credit assignment issues.

What would settle it

If DDPO produces no measurable improvement over reward-weighted likelihood training when optimizing a text-to-image model for compressibility on a fixed set of prompts and images, the claim of superior effectiveness would be falsified.

read the original abstract

Diffusion models are a class of flexible generative models trained with an approximation to the log-likelihood objective. However, most use cases of diffusion models are not concerned with likelihoods, but instead with downstream objectives such as human-perceived image quality or drug effectiveness. In this paper, we investigate reinforcement learning methods for directly optimizing diffusion models for such objectives. We describe how posing denoising as a multi-step decision-making problem enables a class of policy gradient algorithms, which we refer to as denoising diffusion policy optimization (DDPO), that are more effective than alternative reward-weighted likelihood approaches. Empirically, DDPO is able to adapt text-to-image diffusion models to objectives that are difficult to express via prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Finally, we show that DDPO can improve prompt-image alignment using feedback from a vision-language model without the need for additional data collection or human annotation. The project's website can be found at http://rl-diffusion.github.io .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DDPO applies policy gradients to diffusion denoising trajectories and gets gains on reward objectives like compressibility and aesthetics, though long-horizon variance is still an open question.

read the letter

The main point is that this paper treats the multi-step denoising process in diffusion models as an MDP and applies policy gradient RL to optimize directly for downstream rewards instead of log-likelihood. They call the approach DDPO and show it can adapt text-to-image models to goals like image compressibility or aesthetic quality scores, plus improve prompt alignment using VLM feedback without new data collection. That framing is a clean way to handle objectives that are hard to express in prompts or likelihood terms. The experiments appear to beat simple reward-weighted likelihood baselines on those tasks, which is the concrete advance here. The VLM-based alignment result is especially practical since it avoids extra human annotation. The soft spot is the credit assignment problem over 50-1000 denoising steps with only terminal rewards. REINFORCE-style estimators can suffer high variance in such long trajectories, and the abstract does not spell out the variance reduction techniques or ablations that would confirm the gains are stable rather than seed-dependent. If the full paper shows consistent training curves and effective baselines, that concern shrinks; otherwise the method may need more engineering to be reliable. This work sits at the RL-generative modeling intersection and would interest people fine-tuning diffusion models for custom metrics or preference alignment. It has enough substance and a clear empirical angle to merit peer review rather than a desk reject, even if revisions will likely focus on stability details and more controls.

Referee Report

2 major / 1 minor

Summary. The paper proposes denoising diffusion policy optimization (DDPO), which casts the multi-step reverse diffusion process as a Markov decision process and applies policy gradient methods to directly optimize diffusion models for non-likelihood objectives such as image compressibility and aesthetic quality derived from human feedback or vision-language models. It claims DDPO outperforms reward-weighted likelihood baselines and enables adaptation of text-to-image models without additional data collection.

Significance. If the results hold, this provides a practical route to fine-tune diffusion models for objectives that are hard to encode in prompts or likelihoods, with potential impact on alignment and downstream utility in generative modeling. The public website with code and examples is a strength for reproducibility.

major comments (2)

[Abstract] Abstract: the claim of empirical superiority over reward-weighted likelihood baselines is asserted without any quantitative results, controls, or ablation details, preventing assessment of effect sizes or statistical reliability.
[Method] The multi-step denoising MDP formulation (with terminal rewards only and trajectories of 50–1000 steps): the REINFORCE-style policy gradient estimator faces severe credit assignment and variance issues; the manuscript must show that variance remains controlled (e.g., via baselines, variance reduction techniques, or empirical variance plots) rather than relying on the assumption that gradients remain stable.

minor comments (1)

[None] The project website link is helpful; ensure all experimental details (hyperparameters, exact reward models, seed reporting) are also included in the main text or appendix for full reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below with clarifications from the manuscript and indicate revisions where they strengthen the presentation without misrepresenting the existing results.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of empirical superiority over reward-weighted likelihood baselines is asserted without any quantitative results, controls, or ablation details, preventing assessment of effect sizes or statistical reliability.

Authors: The abstract provides a high-level summary of the contribution. Quantitative comparisons to reward-weighted likelihood baselines, including effect sizes on compressibility and aesthetic quality metrics, controls across objectives, and results aggregated over multiple seeds, appear in Section 4 and the associated figures/tables. We will revise the abstract to include a concise quantitative highlight of the observed improvements to better support the claim at the summary level. revision: yes
Referee: [Method] The multi-step denoising MDP formulation (with terminal rewards only and trajectories of 50–1000 steps): the REINFORCE-style policy gradient estimator faces severe credit assignment and variance issues; the manuscript must show that variance remains controlled (e.g., via baselines, variance reduction techniques, or empirical variance plots) rather than relying on the assumption that gradients remain stable.

Authors: We agree that long trajectories introduce credit-assignment and variance challenges for REINFORCE. The DDPO formulation in Section 3 incorporates a learned baseline for variance reduction, and the empirical results in Section 4 demonstrate reliable convergence across 50–1000 step trajectories on multiple tasks. We will expand the method section to explicitly describe the baseline and add a brief discussion (with supporting analysis) of observed gradient stability; if space permits, we will include variance-related plots in the appendix. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method introduces independent optimization procedure

full rationale

The paper frames the denoising process as an MDP to enable policy gradient methods (DDPO) and compares them empirically to reward-weighted likelihood baselines. No derivation reduces by construction to fitted inputs, self-referential definitions, or load-bearing self-citations; the central claims rest on experimental adaptation to compressibility and aesthetic objectives rather than algebraic equivalence to prior parameters. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that denoising steps form a valid MDP amenable to policy gradients and that the chosen reward functions (compressibility, aesthetics, VLM alignment) are well-defined and stable.

axioms (2)

domain assumption Denoising diffusion can be cast as a multi-step Markov decision process.
Stated in the abstract as the enabling step for policy gradient algorithms.
domain assumption Policy gradient methods can be applied directly to the denoising trajectory without prohibitive variance.
Implicit in the claim that DDPO is effective.

invented entities (1)

DDPO algorithm no independent evidence
purpose: Policy optimization for diffusion denoising steps
New named procedure introduced in the paper.

pith-pipeline@v0.9.0 · 5478 in / 1237 out tokens · 31448 ms · 2026-05-11T20:11:23.197353+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
cs.CV 2026-04 unverdicted novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
Flow-GRPO: Training Flow Matching Models via Online RL
cs.CV 2025-05 unverdicted novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
GeoCycler: Reward-Aligned 3D Diffusion for Constraint-Conditioned Cyclic Peptide Design
cs.CE 2026-05 unverdicted novelty 7.0

GeoCycler aligns latent diffusion models via reward-weighted training with a type-gated stair reward to raise cyclic peptide closure rates across multiple topologies on the LNR benchmark.
Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models
cs.CV 2026-05 unverdicted novelty 7.0

Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.
AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment
cs.AI 2026-05 unverdicted novelty 7.0

AutoRubric-T2I learns a small set of interpretable rubrics for VLM judges that outperform scalar reward models on T2I benchmarks while using far less preference data.
AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment
cs.AI 2026-05 unverdicted novelty 7.0

AutoRubric-T2I learns and selects explicit rubrics from preference pairs to guide VLM judges, producing high-quality interpretable rewards for T2I alignment with far less data than traditional Bradley-Terry models.
Muninn: Your Trajectory Diffusion Model But Faster
cs.RO 2026-05 unverdicted novelty 7.0

Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
cs.CV 2026-05 unverdicted novelty 7.0

PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data
cs.LG 2026-05 unverdicted novelty 7.0

SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.
MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery
cs.CV 2026-05 unverdicted novelty 7.0

MotionGRPO models diffusion sampling as a Markov decision process optimized with Group Relative Policy Optimization, using hybrid rewards and noise injection to boost sample diversity and local joint precision in egoc...
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
cs.LG 2026-04 unverdicted novelty 7.0

FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
cs.CV 2026-04 unverdicted novelty 7.0

Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
cs.CV 2026-04 unverdicted novelty 7.0

HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation
cs.CV 2026-04 unverdicted novelty 7.0

OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
Generative Texture Filtering
cs.CV 2026-04 unverdicted novelty 7.0

A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
cs.CV 2026-04 unverdicted novelty 7.0

UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy...
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
cs.CV 2026-04 unverdicted novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
Step-level Denoising-time Diffusion Alignment with Multiple Objectives
cs.LG 2026-04 unverdicted novelty 7.0

MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching
cs.RO 2026-04 unverdicted novelty 7.0

ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
cs.AI 2026-04 unverdicted novelty 7.0

FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.
Discrete Flow Matching Policy Optimization
cs.LG 2026-04 unverdicted novelty 7.0

DoMinO reformulates discrete flow matching sampling as an MDP for unbiased RL fine-tuning with new TV regularizers, yielding better enhancer activity and naturalness on DNA design tasks.
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
cs.CV 2026-03 unverdicted novelty 7.0

SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
cs.LG 2025-09 unverdicted novelty 7.0

DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-...
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
cs.AI 2025-07 unverdicted novelty 7.0

MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-based Humanoid Control
cs.GR 2026-05 unverdicted novelty 6.0

A new diffusion transformer policy with joint attention over actions, states, and text plus RL post-training outperforms prior methods on language alignment and motion quality for humanoid control.
Hierarchical Variational Policies for Reward-Guided Diffusion
cs.LG 2026-05 conditional novelty 6.0

A hierarchical variational formulation amortizes test-time guidance in diffusion models to achieve strong quality-speed tradeoffs with significantly reduced inference compute.
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
Fine-tuning Pocket-Aware Diffusion Models via Denoising Policy Optimization
cs.LG 2026-05 unverdicted novelty 6.0

DEPPA reformulates the denoising process of pocket-aware diffusion models as a multi-step MDP and applies RL fine-tuning with a coarse scheduler to optimize ligands for binding affinity, drug-likeness, synthesizabilit...
Latent Action Control for Reasoning-Guided Unified Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

Latent Action Control learns unobserved action trajectories via variational alignment and GRPO to inject reasoning into flow-based image generation, yielding gains on compositional benchmarks.
Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?
cs.CV 2026-05 unverdicted novelty 6.0

AdaScope adaptively selects optimal RL intervention points during diffusion denoising by monitoring structural and semantic changes, delivering 66% higher performance at 59% lower cost than full-trajectory RL baselines.
Video Models Can Reason with Verifiable Rewards
cs.CV 2026-05 unverdicted novelty 6.0

VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Ma...
CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis
cs.CV 2026-05 unverdicted novelty 6.0

CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.
Driving Intents Amplify Planning-Oriented Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 6.0

DIAL uses intent-conditioned CFG and multi-intent GRPO to expand and preserve diverse modes in continuous-action preference RL, lifting RFS to 9.14 and surpassing both prior best (8.5) and human demonstration (8.13).
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
cs.CV 2026-05 unverdicted novelty 6.0

SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
cs.CV 2026-05 unverdicted novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
cs.LG 2026-05 unverdicted novelty 6.0

dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
cs.AI 2026-05 unverdicted novelty 6.0

Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...
From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data
cs.CV 2026-05 unverdicted novelty 6.0

The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world perfor...
Flow-Direct: Feedback-Efficient and Reusable Guidance for Flow Models via Non-Parametric Guidance Field
cs.LG 2026-05 unverdicted novelty 6.0

Flow-Direct constructs a reusable non-parametric guidance field from the log-density ratio of base and target distributions using all accumulated reward samples for feedback-efficient guidance in flow models.
Response Time Enhances Alignment with Heterogeneous Preferences
cs.LG 2026-05 unverdicted novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data
cs.LG 2026-05 conditional novelty 6.0

SOPE dynamically controls offline training length in online RL using actor-aligned OPE on validation data to stop when benefits saturate, achieving up to 45.6% better performance and 22x less computation on Minari tasks.
Threshold-Guided Optimization for Visual Generative Models
cs.LG 2026-05 unverdicted novelty 6.0

A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
cs.LG 2026-05 unverdicted novelty 6.0

OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
ANO: A Principled Approach to Robust Policy Optimization
cs.AI 2026-05 unverdicted novelty 6.0

ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF e...
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
cs.LG 2026-04 unverdicted novelty 6.0

FMRG is a training-free single-trajectory guidance framework for flow-based models that matches or exceeds baselines on reward-guided tasks and inverse problems using as few as 3 NFEs.
Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization
cs.CV 2026-04 unverdicted novelty 6.0

Semi-DPO applies semi-supervised learning to noisy preference data in diffusion DPO by training first on consensus pairs then iteratively pseudo-labeling conflicts, yielding state-of-the-art alignment with complex hum...
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
cs.LG 2026-04 unverdicted novelty 6.0

V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
NaP-Control: Navigating Diffusion Prior for Versatile and Fast Character Control
cs.GR 2026-04 unverdicted novelty 6.0

NaP-Control uses RL to directly predict optimized diffusion noise from a task-agnostic prior, enabling fast inference and higher success rates for versatile whole-body character control while preserving motion quality.
Region-Constrained Group Relative Policy Optimization for Flow-Based Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

RC-GRPO-Editing constrains GRPO exploration to editing regions via localized noise and attention rewards, improving instruction adherence and non-target preservation in flow-based image editing.
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
cs.LG 2026-04 unverdicted novelty 6.0

Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
cs.AI 2026-04 unverdicted novelty 6.0

VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...
CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.
CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning
cs.LG 2026-03 unverdicted novelty 6.0

CellFluxRL post-trains the CellFlux generative model with reinforcement learning driven by biologically meaningful reward functions, yielding virtual cell images that better satisfy physical and biological constraints...
LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Flow-Based Real-World Super-Resolution
cs.CV 2026-03 unverdicted novelty 6.0

LucidNFT combines a new LR-referenced consistency reward, decoupled normalization, and a real-degradation dataset to improve perceptual quality in flow-matching super-resolution while preserving input fidelity.
IdGlow: Dynamic Identity Modulation for Multi-Subject Generation
cs.CV 2026-02 unverdicted novelty 6.0

IdGlow is a progressive two-stage diffusion framework that uses task-adaptive timestep scheduling, temporal gating, VLM prompt synthesis, and group-level DPO to balance identity preservation and scene coherence in mul...
RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection
cs.CV 2026-02 unverdicted novelty 6.0

RL-RIG uses a generate-reflect-edit loop with reinforcement learning to improve spatial accuracy in image generation, reporting up to 11% gains over prior open-source models on scene-graph metrics.
Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design
cs.LG 2026-02 conditional novelty 6.0

An ELBO-based likelihood estimator from the final generated sample dominates other RL design factors for diffusion models, raising GenEval from 0.24 to 0.95 in 90 GPU hours with better efficiency than prior methods.
Image Diffusion Preview with Consistency Solver
cs.LG 2025-12 unverdicted novelty 6.0

ConsistencySolver enables high-quality low-step diffusion previews by adapting general linear multistep methods into a lightweight RL-optimized solver, matching multistep DPM-Solver FID with 47% fewer steps and cuttin...

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 89 Pith papers · 21 internal anchors

[1]

Is Conditional Generative Modeling all you need for Decision-Making?

Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is con- ditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657,

work page internal anchor Pith review arXiv
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

URL http://github.com/google/jax. 10 Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. arXiv preprint arXiv:2303.04137,

work page internal anchor Pith review arXiv
[4]

arXiv preprint arXiv:2302.11552 , year=

Yilun Du, Conor Durkan, Robin Strudel, Joshua B Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, and Will Grathwohl. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. arXiv preprint arXiv:2302.11552,

work page arXiv
[5]

Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362,

Ying Fan and Kangwook Lee. Optimizing ddpm sampling with shortcut fine-tuning. arXiv preprint arXiv:2301.13362,

work page arXiv
[6]

DPOK: Reinforcement Learning for Fine- tuning Text-to-Image Diffusion Models, November 2023

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. arXiv preprint arXiv:2305.16381,

work page arXiv
[7]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Scaling Laws for Reward Model Overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. arXiv preprint arXiv:2210.10760,

work page internal anchor Pith review arXiv
[9]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

https://distill.pub/2021/multimodal-neurons. Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. IDQL: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573,

work page internal anchor Pith review arXiv 2021
[10]

Jonathan Ho and Tim Salimans

URL http://github.com/google/flax. Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications,

work page 2021
[12]

LoRA: Low-Rank Adaptation of Large Language Models

11 Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Aligning Text-to-Image Models using Human Feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192,

work page internal anchor Pith review arXiv
[14]

Compo- sitional visual generation with composable diffusion models,

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. arXiv preprint arXiv:2206.01714,

work page arXiv
[15]

Teaching language models to support answers with verified quotes

Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, and Nat McAleese. Teach- ing language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147,

work page internal anchor Pith review arXiv
[16]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359,

work page internal anchor Pith review arXiv 2006
[17]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXi...

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

URL https://arxiv.org/abs/1910.00177. Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In International Conference on Machine learning,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[21]

Learning Transferable Visual Models From Natural Language Supervision

12 Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Scott Gray Gabriel Goh, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092,

work page internal anchor Pith review arXiv
[23]

Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242,

work page arXiv
[24]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487,

work page internal anchor Pith review arXiv
[25]

Imagen Video: High Definition Video Generation with Diffusion Models

Arne Schneuing, Yuanqi Du, Arian Jamasb Charles Harris, Ilia Igashov, Weitao Du, Tom Blundell, Pietro Lió, Carla Gomes, Michael Bronstein Max Welling, and Bruno Correia. Structure-based drug design with equivariant diffusion models. arXiv preprint arXiv:2210.02303,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf

URL https://proceedings.neurips.cc/paper_files/paper/1999/file/ 464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf. Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https: //github.com/huggingface/diffusers,

work page 1999
[29]

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193,

work page internal anchor Pith review arXiv
[30]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

work page 2020
[31]

URL https://www.aclweb.org/anthology/2020.emnlp-demos

Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos

work page 2020
[32]

(page 13)

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977,

work page arXiv
[33]

Lion: Latent point diffusion models for 3d shape generation.arXiv preprint arXiv:2210.06978,

Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978,

work page arXiv
[34]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[35]

n animals

14 APPENDIX A O VEROPTIMIZATION Incompressibility DDPODDPO RWRRWR Counting Animals Figure 7 (Reward model overoptimization) Examples of RL overoptimizing reward functions. (L) The diffusion model eventually loses all recognizable semantic content and produces noise when optimizing for incompressibility. (R) When optimized for prompts of the form “ n anima...

work page 2021
[36]

n animals

When optimizing the incompressibility objective, the model eventually stops producing semantically meaningful content, degenerating into high-frequency noise. Similarly, we observed that LLaV A is susceptible to typographic attacks (Goh et al., 2021). When optimizing for alignment with respect to prompts of the form “n animals”, DDPO exploited deficiencie...

work page 2021
[37]

was originally introduced as a way to improve sample quality for conditional generation using the gradients from an image classifier. For a differentiable reward function such as the LAION aesthetics predictor (Schuhmann, 2022), one could naturally imagine an extension to classifier guidance that uses gradients from such a predictor to improve aesthetic s...

work page 2022
[38]

a green colored rabbit

We used the official implementation of universal guidance 1 with the recommended hyperparameters for style transfer, substituting the guidance network with the LAION aesthetics predictor. While universal guidance is able to produce a statistically significant improvement in aesthetic score, the change is small compared to DDPO. We only report results aver...

work page 2023
[39]

a green colored rabbit

as the reward function. We evaluate the model using ImageReward and the LAION aesthetics predictor (Schuhmann, 2022). • Unlike DPOK, we do not employ KL regularization. 0 5k 10k 15k 20k 25k Reward Queries 0.0 0.5 1.0 1.5 2.0 ImageReward Score ImageReward Color Count Composition Location 0 5k 10k 15k 20k 25k Reward Queries 5.2 5.4 5.6 5.8 6.0 6.2 LAION Aes...

work page 2022
[40]

D.1 DDPO I MPLEMENTATION We collect 256 samples per training iteration

as the base model and finetune only the UNet weights while keeping the text encoder and autoencoder weights frozen. D.1 DDPO I MPLEMENTATION We collect 256 samples per training iteration. For DDPOSF, we accumulate gradients across all 256 samples and perform one gradient update. For DDPOIS, we split the samples into 4 minibatches and perform 4 gradient up...

work page 2017
[41]

For reinforcement learning, it does not make sense to train on the unconditional objective since the reward may depend on the context

and ˜ϵθ is the guided ϵ-prediction that is used to compute the next denoised sample. For reinforcement learning, it does not make sense to train on the unconditional objective since the reward may depend on the context. However, we found that when only training on the conditional objective, performance rapidly deteriorated after the first round of finetun...

work page 1999
[42]

unseen animals

and is known to underperform other algorithms in more online settings (Duan et al., 2016). However, we can isolate the effect of the data distribution by varying how interleaved the sampling and training are in RWR. At one extreme is a single-round algorithm (Lee et al., 2023), in which N samples are collected from the pretrained model and used for finetu...

work page 2016

[1] [1]

Is Conditional Generative Modeling all you need for Decision-Making?

Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is con- ditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657,

work page internal anchor Pith review arXiv

[2] [2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

URL http://github.com/google/jax. 10 Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. arXiv preprint arXiv:2303.04137,

work page internal anchor Pith review arXiv

[4] [4]

arXiv preprint arXiv:2302.11552 , year=

Yilun Du, Conor Durkan, Robin Strudel, Joshua B Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, and Will Grathwohl. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. arXiv preprint arXiv:2302.11552,

work page arXiv

[5] [5]

Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362,

Ying Fan and Kangwook Lee. Optimizing ddpm sampling with shortcut fine-tuning. arXiv preprint arXiv:2301.13362,

work page arXiv

[6] [6]

DPOK: Reinforcement Learning for Fine- tuning Text-to-Image Diffusion Models, November 2023

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. arXiv preprint arXiv:2305.16381,

work page arXiv

[7] [7]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Scaling Laws for Reward Model Overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. arXiv preprint arXiv:2210.10760,

work page internal anchor Pith review arXiv

[9] [9]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

https://distill.pub/2021/multimodal-neurons. Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. IDQL: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573,

work page internal anchor Pith review arXiv 2021

[10] [10]

Jonathan Ho and Tim Salimans

URL http://github.com/google/flax. Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications,

work page 2021

[11] [12]

LoRA: Low-Rank Adaptation of Large Language Models

11 Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [13]

Aligning Text-to-Image Models using Human Feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192,

work page internal anchor Pith review arXiv

[13] [14]

Compo- sitional visual generation with composable diffusion models,

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. arXiv preprint arXiv:2206.01714,

work page arXiv

[14] [15]

Teaching language models to support answers with verified quotes

Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, and Nat McAleese. Teach- ing language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147,

work page internal anchor Pith review arXiv

[15] [16]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359,

work page internal anchor Pith review arXiv 2006

[16] [17]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXi...

work page internal anchor Pith review Pith/arXiv arXiv

[17] [18]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

work page internal anchor Pith review Pith/arXiv arXiv

[18] [20]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

URL https://arxiv.org/abs/1910.00177. Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In International Conference on Machine learning,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[19] [21]

Learning Transferable Visual Models From Natural Language Supervision

12 Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [22]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Scott Gray Gabriel Goh, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092,

work page internal anchor Pith review arXiv

[21] [23]

Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242,

work page arXiv

[22] [24]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487,

work page internal anchor Pith review arXiv

[23] [25]

Imagen Video: High Definition Video Generation with Diffusion Models

Arne Schneuing, Yuanqi Du, Arian Jamasb Charles Harris, Ilia Igashov, Weitao Du, Tom Blundell, Pietro Lió, Carla Gomes, Michael Bronstein Max Welling, and Bruno Correia. Structure-based drug design with equivariant diffusion models. arXiv preprint arXiv:2210.02303,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [26]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [27]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [28]

Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf

URL https://proceedings.neurips.cc/paper_files/paper/1999/file/ 464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf. Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https: //github.com/huggingface/diffusers,

work page 1999

[27] [29]

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193,

work page internal anchor Pith review arXiv

[28] [30]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

work page 2020

[29] [31]

URL https://www.aclweb.org/anthology/2020.emnlp-demos

Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos

work page 2020

[30] [32]

(page 13)

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977,

work page arXiv

[31] [33]

Lion: Latent point diffusion models for 3d shape generation.arXiv preprint arXiv:2210.06978,

Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978,

work page arXiv

[32] [34]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,

work page internal anchor Pith review Pith/arXiv arXiv 1909

[33] [35]

n animals

14 APPENDIX A O VEROPTIMIZATION Incompressibility DDPODDPO RWRRWR Counting Animals Figure 7 (Reward model overoptimization) Examples of RL overoptimizing reward functions. (L) The diffusion model eventually loses all recognizable semantic content and produces noise when optimizing for incompressibility. (R) When optimized for prompts of the form “ n anima...

work page 2021

[34] [36]

n animals

When optimizing the incompressibility objective, the model eventually stops producing semantically meaningful content, degenerating into high-frequency noise. Similarly, we observed that LLaV A is susceptible to typographic attacks (Goh et al., 2021). When optimizing for alignment with respect to prompts of the form “n animals”, DDPO exploited deficiencie...

work page 2021

[35] [37]

was originally introduced as a way to improve sample quality for conditional generation using the gradients from an image classifier. For a differentiable reward function such as the LAION aesthetics predictor (Schuhmann, 2022), one could naturally imagine an extension to classifier guidance that uses gradients from such a predictor to improve aesthetic s...

work page 2022

[36] [38]

a green colored rabbit

We used the official implementation of universal guidance 1 with the recommended hyperparameters for style transfer, substituting the guidance network with the LAION aesthetics predictor. While universal guidance is able to produce a statistically significant improvement in aesthetic score, the change is small compared to DDPO. We only report results aver...

work page 2023

[37] [39]

a green colored rabbit

as the reward function. We evaluate the model using ImageReward and the LAION aesthetics predictor (Schuhmann, 2022). • Unlike DPOK, we do not employ KL regularization. 0 5k 10k 15k 20k 25k Reward Queries 0.0 0.5 1.0 1.5 2.0 ImageReward Score ImageReward Color Count Composition Location 0 5k 10k 15k 20k 25k Reward Queries 5.2 5.4 5.6 5.8 6.0 6.2 LAION Aes...

work page 2022

[38] [40]

D.1 DDPO I MPLEMENTATION We collect 256 samples per training iteration

as the base model and finetune only the UNet weights while keeping the text encoder and autoencoder weights frozen. D.1 DDPO I MPLEMENTATION We collect 256 samples per training iteration. For DDPOSF, we accumulate gradients across all 256 samples and perform one gradient update. For DDPOIS, we split the samples into 4 minibatches and perform 4 gradient up...

work page 2017

[39] [41]

For reinforcement learning, it does not make sense to train on the unconditional objective since the reward may depend on the context

and ˜ϵθ is the guided ϵ-prediction that is used to compute the next denoised sample. For reinforcement learning, it does not make sense to train on the unconditional objective since the reward may depend on the context. However, we found that when only training on the conditional objective, performance rapidly deteriorated after the first round of finetun...

work page 1999

[40] [42]

unseen animals

and is known to underperform other algorithms in more online settings (Duan et al., 2016). However, we can isolate the effect of the data distribution by varying how interleaved the sampling and training are in RWR. At one extreme is a single-round algorithm (Lee et al., 2023), in which N samples are collected from the pretrained model and used for finetu...

work page 2016