Efficient Adjoint Matching for Fine-tuning Diffusion Models

Dongsoo Shin; Jaemoo Choi; Jaewoong Choi; Jeongwoo Shin; Joonseok Lee; Wei Guo; Yongxin Chen; Yuchen Zhu

arxiv: 2605.11480 · v2 · pith:DRRUJNA2new · submitted 2026-05-12 · 💻 cs.LG

Efficient Adjoint Matching for Fine-tuning Diffusion Models

Jeongwoo Shin , Dongsoo Shin , Yuchen Zhu , Wei Guo , Yongxin Chen , Joonseok Lee , Jaewoong Choi , Jaemoo Choi This is my paper

Pith reviewed 2026-05-20 22:04 UTC · model grok-4.3

classification 💻 cs.LG

keywords efficient adjoint matchingdiffusion modelsreward fine-tuningstochastic optimal controltext-to-image generationadjoint methodstraining efficiency

0 comments

The pith

Efficient Adjoint Matching reformulates the stochastic optimal control problem with a linear base drift and modified terminal cost to enable faster diffusion model fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Efficient Adjoint Matching to address the high computational cost of reward fine-tuning in diffusion models. Standard Adjoint Matching casts the task as a stochastic optimal control problem that demands full stochastic trajectory simulations and repeated backward adjoint ODE solves due to the complex base drift inherited from pretrained models. The authors replace this with a linear base drift plus an adjusted terminal cost, which permits few-step deterministic ODE sampling during training and supplies a closed-form adjoint expression that removes all backward simulation. On text-to-image benchmarks the method reaches comparable or better scores on PickScore, ImageReward, HPSv2.1, CLIPScore and Aesthetics while converging up to four times faster. A sympathetic reader would care because the change lowers the barrier to aligning large generative models with human preferences without sacrificing alignment quality.

Core claim

Reformulating the SOC problem with a linear base drift and a correspondingly modified terminal cost removes both sources of inefficiency in Adjoint Matching: it enables training-time sampling with a few-step deterministic ODE solver and yields a closed-form adjoint solution that eliminates backward adjoint simulation, while matching or surpassing prior performance on standard text-to-image reward fine-tuning benchmarks.

What carries the argument

Efficient Adjoint Matching (EAM) is the reformulation that swaps the pretrained model's non-trivial base drift for a linear one and adjusts the terminal cost to preserve the original objective, thereby permitting cheap deterministic sampling and an analytic adjoint.

If this is right

Training requires only a small number of deterministic function evaluations instead of full stochastic trajectories.
Backward adjoint simulation is replaced by a closed-form expression, cutting memory and compute per iteration.
Convergence occurs up to four times faster while scores on PickScore, ImageReward, HPSv2.1, CLIPScore and Aesthetics stay at or above prior levels.
The approach applies directly to existing pretrained diffusion models without changing their forward dynamics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same linear-drift trick could be tested on flow-matching or other continuous-time generative models that currently rely on adjoint-based fine-tuning.
If the closed-form adjoint remains stable at very low step counts, the method might support on-the-fly preference updates during interactive generation sessions.
The simplified dynamics open the possibility of deriving explicit convergence rates for reward alignment that were previously intractable under the full nonlinear drift.
Practitioners could combine EAM with parameter-efficient adapters to fine-tune only small subsets of a large diffusion model at even lower cost.

Load-bearing premise

The linear base drift together with the modified terminal cost still solves the original reward alignment objective or produces comparable results on human-preference metrics.

What would settle it

A side-by-side run on the same text-to-image benchmarks in which EAM scores substantially lower than standard Adjoint Matching on PickScore or ImageReward would show the approximation fails to preserve alignment quality.

Figures

Figures reproduced from arXiv: 2605.11480 by Dongsoo Shin, Jaemoo Choi, Jaewoong Choi, Jeongwoo Shin, Joonseok Lee, Wei Guo, Yongxin Chen, Yuchen Zhu.

**Figure 1.** Figure 1: Comparison of Efficient Adjoint Matching (EAM) with Adjoint Matching (AM). (Left) AM relies on a stochastic SDE solver to construct each training trajectory and a sequential backward simulation to obtain the adjoint state along that trajectory. (Right) EAM eliminates both: intermediate states Xt are obtained by first simulating the endpoint X1 with a few-step ODE and then sampling Xt from the original nois… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison. See App. D for more examples. via redesigning the base dynamic. As shown in Algorithm 1, we simulate X1 with efficient ODE solver (17) and construct the intermediate state Xt by sampling from the original noising kernel qt(·|X1) Then, we minimize adjoint matching loss (13), which does not need adjoint ODE simulation. 4 Experiments 4.1 Experimental Settings Setup. We fine-tune Stable… view at source ↗

**Figure 3.** Figure 3: PickScore on DrawBench by training time (GPU hours). EAM converges significantly faster than AM (up to 4×). Quantitative results. As shown in Tab. 2, EAM consistently matches or outperforms AM across most metrics in both the single-reward and multi-reward settings. Both EAM and AM substantially improve over the pretrained SD3.5-M baseline, while SD3.5- M with Classifier-Free-Guidance (CFG) [17] attains the… view at source ↗

**Figure 4.** Figure 4: Qualitative Results. AM and EAM denote models fine-tuned using PickScore, while AM + Multi and EAM + Multi denote models fine-tuned using a combination of PickScore, HPSv2.1, and Aesthetics. v [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative Results. AM and EAM denote models fine-tuned using PickScore, while AM + Multi and EAM + Multi denote models fine-tuned using a combination of PickScore, HPSv2.1, and Aesthetics. vi [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative Results. AM and EAM denote models fine-tuned using PickScore, while AM + Multi and EAM + Multi denote models fine-tuned using a combination of PickScore, HPSv2.1, and Aesthetics. vii [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Reward fine-tuning has become a common approach for aligning pretrained diffusion and flow models with human preferences in text-to-image generation. Among reward-gradient-based methods, Adjoint Matching (AM) provides a principled formulation by casting reward fine-tuning as a stochastic optimal control (SOC) problem. However, AM inevitably requires a substantial computational cost: it requires (i) stochastic simulation of full generative trajectories under memoryless dynamics, resulting in a large number of function evaluations, and (ii) backward ODE simulation of the adjoint state along each sampled trajectory. In this work, we observe that both bottlenecks are closely tied to the \textit{non-trivial base drift} inherited from the pretrained model. Motivated by this observation, we propose \textbf{Efficient Adjoint Matching (EAM)}, which substantially improves training efficiency by reformulating the SOC problem with a \textit{linear base drift} and a correspondingly modified \textit{terminal cost}. This reformulation removes both sources of inefficiency; it enables training-time sampling with a few-step deterministic ODE solver and yields a closed-form adjoint solution that eliminates backward adjoint simulation. On standard text-to-image reward fine-tuning benchmarks, EAM converges up to 4x faster than AM and matches or surpasses it across various metrics including PickScore, ImageReward, HPSv2.1, CLIPScore and Aesthetics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EAM swaps to a linear base drift plus adjusted terminal cost to drop both simulation and adjoint costs in reward fine-tuning, but the exact equivalence to the original SOC objective still needs checking.

read the letter

The main point here is that the authors reformulate the adjoint matching setup by replacing the pretrained model's drift with a linear one and changing the terminal cost to compensate. This lets them replace stochastic trajectory sampling with a cheap deterministic few-step ODE and replace the backward adjoint pass with a closed form. The result is the reported 4x faster convergence on text-to-image reward benchmarks while staying at or above AM on PickScore, ImageReward, HPSv2.1, CLIPScore, and Aesthetics scores.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Efficient Adjoint Matching (EAM) for reward fine-tuning of pretrained diffusion models. It reformulates the stochastic optimal control (SOC) problem underlying Adjoint Matching (AM) by replacing the pretrained model's non-trivial base drift with a linear base drift and adjusting the terminal cost accordingly. This change is claimed to enable few-step deterministic ODE sampling during training and a closed-form adjoint solution, eliminating backward adjoint simulation. Empirical results on text-to-image benchmarks report up to 4x faster convergence while matching or exceeding AM on PickScore, ImageReward, HPSv2.1, CLIPScore, and Aesthetics.

Significance. If the reformulation preserves the original reward-alignment objective, the work offers a practical route to scaling reward-gradient fine-tuning by removing the two dominant computational costs in AM. The approach is grounded in the SOC formulation and directly targets simulation and adjoint bottlenecks that limit current methods. Reproducible benchmarks and the explicit identification of the base-drift source of inefficiency are positive features.

major comments (2)

[§3.2] §3.2 (Reformulation of the SOC problem): The manuscript must explicitly derive or prove that the modified terminal cost exactly compensates for the switch to linear base drift so that the resulting value function and optimal policy coincide with those of the original AM problem. The abstract presents the change as removing both sources of inefficiency, yet the provided description does not contain the step-by-step verification that the two formulations are equivalent (or differ by a negligible bias) for the reward objective. This equivalence is load-bearing for the claim that EAM is a valid, faster drop-in replacement rather than an optimization of a different objective.
[§4.3] §4.3 (Experimental validation): The reported metric parity and 4x speedup are shown on standard benchmarks, but the paper should include an ablation that isolates the effect of the linear-drift approximation (e.g., comparing EAM against AM with the same number of function evaluations or against a version that retains the original drift but uses the closed-form adjoint). Without such controls, it remains unclear whether the efficiency gain comes at the cost of solving a strictly easier problem.

minor comments (2)

[§3] Notation for the linear base drift and the modified terminal cost should be introduced with explicit definitions and contrasted with the original quantities in a single table or equation block for clarity.
[§3.3] The description of the few-step deterministic ODE solver used at training time would benefit from a short pseudocode block or reference to the exact integrator and step count.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We appreciate the emphasis on ensuring the equivalence of the reformulated SOC problem and the need for additional experimental controls. We will revise the manuscript to address these points by adding an explicit derivation and ablation studies, which we believe will strengthen the presentation of Efficient Adjoint Matching.

read point-by-point responses

Referee: [§3.2] §3.2 (Reformulation of the SOC problem): The manuscript must explicitly derive or prove that the modified terminal cost exactly compensates for the switch to linear base drift so that the resulting value function and optimal policy coincide with those of the original AM problem. The abstract presents the change as removing both sources of inefficiency, yet the provided description does not contain the step-by-step verification that the two formulations are equivalent (or differ by a negligible bias) for the reward objective. This equivalence is load-bearing for the claim that EAM is a valid, faster drop-in replacement rather than an optimization of a different objective.

Authors: We thank the referee for this important observation. Upon reflection, while the manuscript motivates the reformulation by noting that the non-trivial base drift causes the computational bottlenecks and adjusts the terminal cost to maintain the reward objective, we agree that a more explicit step-by-step derivation is necessary to rigorously show that the value function and optimal policy are identical to those in the original Adjoint Matching problem. In the revised manuscript, we will expand Section 3.2 to include a detailed proof demonstrating the exact compensation by the modified terminal cost, thereby confirming that EAM optimizes the same objective. This will be presented with mathematical derivations showing the equivalence of the two SOC formulations. revision: yes
Referee: [§4.3] §4.3 (Experimental validation): The reported metric parity and 4x speedup are shown on standard benchmarks, but the paper should include an ablation that isolates the effect of the linear-drift approximation (e.g., comparing EAM against AM with the same number of function evaluations or against a version that retains the original drift but uses the closed-form adjoint). Without such controls, it remains unclear whether the efficiency gain comes at the cost of solving a strictly easier problem.

Authors: We acknowledge the value of isolating the impact of the linear base drift approximation through targeted ablations. The current results show that EAM achieves up to 4x faster convergence while matching or exceeding AM on multiple metrics, but additional controls would better attribute the gains. In the revised version, we will incorporate an ablation study that compares EAM and AM under equivalent computational constraints, such as using the same number of function evaluations during training. We will also discuss the feasibility of a hybrid approach that applies the closed-form adjoint to the original drift, though this may require further analysis as the closed-form solution is derived specifically from the linear drift assumption. These additions will help demonstrate that the efficiency improvements do not come from solving an easier problem but from the reformulation's ability to enable deterministic sampling and closed-form adjoints while preserving performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the reformulation-based derivation

full rationale

The paper proposes EAM as a structural reformulation of the SOC problem using a linear base drift and correspondingly modified terminal cost, motivated by observed bottlenecks in the original AM method. This is presented as an explicit design choice that enables few-step ODE sampling and closed-form adjoint, rather than any quantity derived from fitted parameters, self-referential predictions, or load-bearing self-citations. No equations reduce to their inputs by construction, and performance claims are supported by external benchmark comparisons (PickScore, ImageReward, etc.) without statistical forcing from the same data. The derivation chain remains self-contained and independent of the patterns that would indicate circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central reformulation rests on introducing a linear base drift and a correspondingly modified terminal cost; no explicit free parameters, standard axioms, or new invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5790 in / 1036 out tokens · 52265 ms · 2026-05-20T22:04:15.821384+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

[1]

Albergo, N

M. Albergo, N. M. Boffi, and E. Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 2025

work page 2025
[2]

Black, M

K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine. Training diffusion models with reinforcement learning. InICLR, 2024

work page 2024
[3]

Blessing, J

D. Blessing, J. Berner, L. Richter, C. Domingo-Enrich, Y . Du, A. Vahdat, and G. Neumann. Trust region constrained measure transport in path space for stochastic optimal control and inference. InNeurIPS, 2025

work page 2025
[4]

J. Choi, Y . Zhu, W. Guo, P. Molodyk, B. Yuan, J. Bai, Y . Xin, M. Tao, and Y . Chen. Rethinking the design space of reinforcement learning for diffusion models: On the importance of likelihood estimation beyond loss design. InICML, 2026

work page 2026
[5]

Clark, P

K. Clark, P. Vicol, K. Swersky, and D. J. Fleet. Directly fine-tuning diffusion models on differentiable rewards. InICLR, 2024

work page 2024
[6]

Domingo-Enrich, M

C. Domingo-Enrich, M. Drozdzal, B. Karrer, and R. T. Q. Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control. InICLR, 2025

work page 2025
[7]

B. Efron. Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106(496):1602–1614, 2011

work page 2011
[8]

Esser, S

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, and R. Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

work page 2024
[9]

Fan and K

Y . Fan and K. Lee. Optimizing DDPM sampling with shortcut fine-tuning. InICML, 2023

work page 2023
[10]

Y . Fan, O. Watkins, Y . Du, H. Liu, M. Ryu, C. Boutilier, P. Abbeel, M. Ghavamzadeh, K. Lee, and K. Lee. DPOK: Reinforcement learning for fine-tuning text-to-image diffusion models. InNeurIPS, 2023

work page 2023
[11]

W. Guo, J. Choi, Y . Zhu, M. Tao, and Y . Chen. Proximal diffusion neural sampler. InICML, 2026

work page 2026
[12]

X. Guo, M. Cui, L. Bo, and D. Huang. ShortFT: Diffusion model alignment via shortcut-based fine-tuning. InICCV, 2025

work page 2025
[13]

Havens, B

A. Havens, B. K. Miller, B. Yan, C. Domingo-Enrich, A. Sriram, B. Wood, D. Levine, B. Hu, B. Amos, B. Karrer, X. Fu, G.-H. Liu, and R. T. Q. Chen. Adjoint sampling: Highly scalable diffusion samplers via adjoint matching. InICML, 2025

work page 2025
[14]

X. He, S. Fu, Y . Zhao, W. Li, J. Yang, D. Yin, F. Rao, and B. Zhang. TempFlow-GRPO: When timing matters for grpo in flow models. InICLR, 2026. 10

work page 2026
[15]

Hessel, A

J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y . Choi. CLIPScore: A reference-free evaluation metric for image captioning. InEMNLP, 2021

work page 2021
[16]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

work page 2020
[17]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

work page 2022
[19]

Kirstain, A

Y . Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy. Pick-a-Pic: An open dataset of user preferences for text-to-image generation. InNeurIPS, 2023

work page 2023
[20]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InICLR, 2023

work page 2023
[21]

G.-H. Liu, J. Choi, Y . Chen, B. K. Miller, and R. T. Q. Chen. Adjoint schrödinger bridge sampler. In NeurIPS, 2025

work page 2025
[22]

J. Liu, G. Liu, J. Liang, Y . Li, J. Liu, X. Wang, P. Wan, D. Zhang, and W. Ouyang. Flow-GRPO: Training flow matching models via online rl. InNeurIPS, 2025

work page 2025
[23]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

work page 2023
[24]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InICLR, 2019

work page 2019
[25]

C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Research, 2025

work page 2025
[26]

arXiv:2310.03739, 2023

M. Prabhudesai, A. Goyal, D. Pathak, and K. Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv:2310.03739, 2024

work page arXiv 2024
[27]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

work page 2022
[28]

Saharia, W

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, J. Ho, D. Fleet, and M. Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. InNeurIPS, 2022

work page 2022
[29]

Schuhmann

C. Schuhmann. LAION-AESTHETICS, Aug. 2022. https://laion.ai/blog/laion-aesthetics/. Accessed: 2026-04-30

work page 2022
[30]

Y . Shi, V . De Bortoli, A. Campbell, and A. Doucet. Diffusion schrödinger bridge matching. InNeurIPS, 2023

work page 2023
[31]

J. Shin, J. Sul, J. Lee, J. Choi, and J. Choi. Efficient generative modeling beyond memoryless diffusion via adjoint schrödinger bridge matching. InICML, 2026

work page 2026
[32]

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021

work page 2021
[33]

J. Wang, J. Liang, J. Liu, H. Liu, G. Liu, J. Zheng, W. Pang, A. Ma, Z. Xie, X. Wang, M. Wang, P. Wan, and X. Liang. GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping. arXiv:2510.22319, 2025

work page arXiv 2025
[34]

Y . Wang, Z. Li, Y . Zang, Y . Zhou, J. Bu, C. Wang, Q. Lu, C. Jin, and J. Wang. Pref-GRPO: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv:2508.20751, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

X. Wu, Y . Hao, M. Zhang, K. Sun, Z. Huang, G. Song, Y . Liu, and H. Li. Deep reward supervisions for tuning text-to-image diffusion models. InECCV, 2024

work page 2024
[36]

X. Wu, K. Sun, F. Zhu, R. Zhao, and H. Li. Human preference score: Better aligning text-to-image models with human preference. InICCV, 2023

work page 2023
[37]

J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation. InNeurIPS, 2023

work page 2023
[38]

S. Xue, C. Ge, S. Zhang, Y . Li, and Z.-M. Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv:2509.25050, 2025. 11

work page arXiv 2025
[39]

Z. Xue, J. Wu, Y . Gao, F. Kong, L. Zhu, M. Chen, Z. Liu, W. Liu, Q. Guo, W. Huang, and P. Luo. DanceGRPO: Unleashing grpo on visual generation.arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

H. Ye, K. Zheng, J. Xu, P. Li, H. Chen, J. Han, S. Liu, Q. Zhang, H. Mao, Z. Hao, P. Chattopadhyay, D. Yang, L. Feng, M. Liao, J. Bai, M.-Y . Liu, J. Zou, and S. Ermon. Data-regularized reinforcement learning for diffusion models at scale.arXiv:2512.04332, 2025

work page arXiv 2025
[41]

H. Zhao, H. Chen, J. Zhang, D. D. Yao, and W. Tang. Score as action: Fine-tuning diffusion generative models by continuous-time reinforcement learning. InICML, 2025

work page 2025
[42]

Zheng, H

K. Zheng, H. Chen, H. Ye, H. Wang, Q. Zhang, K. Jiang, H. Su, S. Ermon, J. Zhu, and M.-Y . Liu. DiffusionNFT: Online diffusion reinforcement with forward process. InICLR, 2026. 12 Appendix A Impact Satement This work develops computational methods for reward-based fine-tuning of diffusion models. Our study is theoretical and computational in nature and us...

work page 2026
[43]

(41) yields the announced family D(t) = 2Ct2 −1 t(2Ct 2 −2t+ 1) , C > 1 2 ,(44) which is Eq

Substituting this It into Eq. (41) yields the announced family D(t) = 2Ct2 −1 t(2Ct 2 −2t+ 1) , C > 1 2 ,(44) which is Eq. (15). Step 5: the second constraint is automatic.Using the explicit forms of It, Φt = (2C− 1)t/(2Ct 2 −2t+ 1) , and Jt =I 1 −I t Φ2 t , a direct computation (using ¯αt = Φ0Jt/(ΦtI1) and γ2 t = 2I tJt/I1) verifies that ¯α2 t +γ 2 t = (...

work page
[44]

(39) therefore imposes no additional restriction onD(t)

The second constraint in Eq. (39) therefore imposes no additional restriction onD(t). ii Step 6: terminal distribution.At t= 1 , Φ0 is the value of Φ at t= 0 , which evaluates to Φ0 = 0 (since the numerator (2C−1)t vanishes at t= 0 ). Hence ¯α1 = 0, and X1 = ¯β1X1 +γ 1 ε marginalizes (usingX 0 ∼ N(0, I)and ¯β1 = 1) to pbase 1 =N 0,(2C−1)I ,(45) sinceI 1 =...

work page
[45]

Exact adjoint calculation.Plugging the linear base drift in Eq

The family is therefore unique up to the single scalarC. Exact adjoint calculation.Plugging the linear base drift in Eq. (15) into Eq. (10) yields a(t;X t) = (2C−1)t 2Ct2 −2t+ 1 a(1;X 1), a(1;X 1) =∇g(X 1).(46) B.4 Proof of Proposition 3.2 We use the standard SOC reduction for reward-tilted sampling [6, 13]: under the SOC problem (4)– (5) with terminal co...

work page 2000
[46]

CFG is known to improve alignment between the generated image and the text prompt, but this comes at the cost of doubled NFEs required to compute the unconditional prediction

in this section. CFG is known to improve alignment between the generated image and the text prompt, but this comes at the cost of doubled NFEs required to compute the unconditional prediction. As shown in Tab. 4, applying CFG improves improves all the metrics except Aesthetics. D Additional Qualitative Examples iv S D 3 . 5 S D 3 . 5 + C F G A M A M + M u...

work page

[1] [1]

Albergo, N

M. Albergo, N. M. Boffi, and E. Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 2025

work page 2025

[2] [2]

Black, M

K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine. Training diffusion models with reinforcement learning. InICLR, 2024

work page 2024

[3] [3]

Blessing, J

D. Blessing, J. Berner, L. Richter, C. Domingo-Enrich, Y . Du, A. Vahdat, and G. Neumann. Trust region constrained measure transport in path space for stochastic optimal control and inference. InNeurIPS, 2025

work page 2025

[4] [4]

J. Choi, Y . Zhu, W. Guo, P. Molodyk, B. Yuan, J. Bai, Y . Xin, M. Tao, and Y . Chen. Rethinking the design space of reinforcement learning for diffusion models: On the importance of likelihood estimation beyond loss design. InICML, 2026

work page 2026

[5] [5]

Clark, P

K. Clark, P. Vicol, K. Swersky, and D. J. Fleet. Directly fine-tuning diffusion models on differentiable rewards. InICLR, 2024

work page 2024

[6] [6]

Domingo-Enrich, M

C. Domingo-Enrich, M. Drozdzal, B. Karrer, and R. T. Q. Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control. InICLR, 2025

work page 2025

[7] [7]

B. Efron. Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106(496):1602–1614, 2011

work page 2011

[8] [8]

Esser, S

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, and R. Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

work page 2024

[9] [9]

Fan and K

Y . Fan and K. Lee. Optimizing DDPM sampling with shortcut fine-tuning. InICML, 2023

work page 2023

[10] [10]

Y . Fan, O. Watkins, Y . Du, H. Liu, M. Ryu, C. Boutilier, P. Abbeel, M. Ghavamzadeh, K. Lee, and K. Lee. DPOK: Reinforcement learning for fine-tuning text-to-image diffusion models. InNeurIPS, 2023

work page 2023

[11] [11]

W. Guo, J. Choi, Y . Zhu, M. Tao, and Y . Chen. Proximal diffusion neural sampler. InICML, 2026

work page 2026

[12] [12]

X. Guo, M. Cui, L. Bo, and D. Huang. ShortFT: Diffusion model alignment via shortcut-based fine-tuning. InICCV, 2025

work page 2025

[13] [13]

Havens, B

A. Havens, B. K. Miller, B. Yan, C. Domingo-Enrich, A. Sriram, B. Wood, D. Levine, B. Hu, B. Amos, B. Karrer, X. Fu, G.-H. Liu, and R. T. Q. Chen. Adjoint sampling: Highly scalable diffusion samplers via adjoint matching. InICML, 2025

work page 2025

[14] [14]

X. He, S. Fu, Y . Zhao, W. Li, J. Yang, D. Yin, F. Rao, and B. Zhang. TempFlow-GRPO: When timing matters for grpo in flow models. InICLR, 2026. 10

work page 2026

[15] [15]

Hessel, A

J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y . Choi. CLIPScore: A reference-free evaluation metric for image captioning. InEMNLP, 2021

work page 2021

[16] [16]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

work page 2020

[17] [17]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

work page 2022

[19] [19]

Kirstain, A

Y . Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy. Pick-a-Pic: An open dataset of user preferences for text-to-image generation. InNeurIPS, 2023

work page 2023

[20] [20]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InICLR, 2023

work page 2023

[21] [21]

G.-H. Liu, J. Choi, Y . Chen, B. K. Miller, and R. T. Q. Chen. Adjoint schrödinger bridge sampler. In NeurIPS, 2025

work page 2025

[22] [22]

J. Liu, G. Liu, J. Liang, Y . Li, J. Liu, X. Wang, P. Wan, D. Zhang, and W. Ouyang. Flow-GRPO: Training flow matching models via online rl. InNeurIPS, 2025

work page 2025

[23] [23]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

work page 2023

[24] [24]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InICLR, 2019

work page 2019

[25] [25]

C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Research, 2025

work page 2025

[26] [26]

arXiv:2310.03739, 2023

M. Prabhudesai, A. Goyal, D. Pathak, and K. Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv:2310.03739, 2024

work page arXiv 2024

[27] [27]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

work page 2022

[28] [28]

Saharia, W

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, J. Ho, D. Fleet, and M. Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. InNeurIPS, 2022

work page 2022

[29] [29]

Schuhmann

C. Schuhmann. LAION-AESTHETICS, Aug. 2022. https://laion.ai/blog/laion-aesthetics/. Accessed: 2026-04-30

work page 2022

[30] [30]

Y . Shi, V . De Bortoli, A. Campbell, and A. Doucet. Diffusion schrödinger bridge matching. InNeurIPS, 2023

work page 2023

[31] [31]

J. Shin, J. Sul, J. Lee, J. Choi, and J. Choi. Efficient generative modeling beyond memoryless diffusion via adjoint schrödinger bridge matching. InICML, 2026

work page 2026

[32] [32]

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021

work page 2021

[33] [33]

J. Wang, J. Liang, J. Liu, H. Liu, G. Liu, J. Zheng, W. Pang, A. Ma, Z. Xie, X. Wang, M. Wang, P. Wan, and X. Liang. GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping. arXiv:2510.22319, 2025

work page arXiv 2025

[34] [34]

Y . Wang, Z. Li, Y . Zang, Y . Zhou, J. Bu, C. Wang, Q. Lu, C. Jin, and J. Wang. Pref-GRPO: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv:2508.20751, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

X. Wu, Y . Hao, M. Zhang, K. Sun, Z. Huang, G. Song, Y . Liu, and H. Li. Deep reward supervisions for tuning text-to-image diffusion models. InECCV, 2024

work page 2024

[36] [36]

X. Wu, K. Sun, F. Zhu, R. Zhao, and H. Li. Human preference score: Better aligning text-to-image models with human preference. InICCV, 2023

work page 2023

[37] [37]

J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation. InNeurIPS, 2023

work page 2023

[38] [38]

S. Xue, C. Ge, S. Zhang, Y . Li, and Z.-M. Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv:2509.25050, 2025. 11

work page arXiv 2025

[39] [39]

Z. Xue, J. Wu, Y . Gao, F. Kong, L. Zhu, M. Chen, Z. Liu, W. Liu, Q. Guo, W. Huang, and P. Luo. DanceGRPO: Unleashing grpo on visual generation.arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

H. Ye, K. Zheng, J. Xu, P. Li, H. Chen, J. Han, S. Liu, Q. Zhang, H. Mao, Z. Hao, P. Chattopadhyay, D. Yang, L. Feng, M. Liao, J. Bai, M.-Y . Liu, J. Zou, and S. Ermon. Data-regularized reinforcement learning for diffusion models at scale.arXiv:2512.04332, 2025

work page arXiv 2025

[41] [41]

H. Zhao, H. Chen, J. Zhang, D. D. Yao, and W. Tang. Score as action: Fine-tuning diffusion generative models by continuous-time reinforcement learning. InICML, 2025

work page 2025

[42] [42]

Zheng, H

K. Zheng, H. Chen, H. Ye, H. Wang, Q. Zhang, K. Jiang, H. Su, S. Ermon, J. Zhu, and M.-Y . Liu. DiffusionNFT: Online diffusion reinforcement with forward process. InICLR, 2026. 12 Appendix A Impact Satement This work develops computational methods for reward-based fine-tuning of diffusion models. Our study is theoretical and computational in nature and us...

work page 2026

[43] [43]

(41) yields the announced family D(t) = 2Ct2 −1 t(2Ct 2 −2t+ 1) , C > 1 2 ,(44) which is Eq

Substituting this It into Eq. (41) yields the announced family D(t) = 2Ct2 −1 t(2Ct 2 −2t+ 1) , C > 1 2 ,(44) which is Eq. (15). Step 5: the second constraint is automatic.Using the explicit forms of It, Φt = (2C− 1)t/(2Ct 2 −2t+ 1) , and Jt =I 1 −I t Φ2 t , a direct computation (using ¯αt = Φ0Jt/(ΦtI1) and γ2 t = 2I tJt/I1) verifies that ¯α2 t +γ 2 t = (...

work page

[44] [44]

(39) therefore imposes no additional restriction onD(t)

The second constraint in Eq. (39) therefore imposes no additional restriction onD(t). ii Step 6: terminal distribution.At t= 1 , Φ0 is the value of Φ at t= 0 , which evaluates to Φ0 = 0 (since the numerator (2C−1)t vanishes at t= 0 ). Hence ¯α1 = 0, and X1 = ¯β1X1 +γ 1 ε marginalizes (usingX 0 ∼ N(0, I)and ¯β1 = 1) to pbase 1 =N 0,(2C−1)I ,(45) sinceI 1 =...

work page

[45] [45]

Exact adjoint calculation.Plugging the linear base drift in Eq

The family is therefore unique up to the single scalarC. Exact adjoint calculation.Plugging the linear base drift in Eq. (15) into Eq. (10) yields a(t;X t) = (2C−1)t 2Ct2 −2t+ 1 a(1;X 1), a(1;X 1) =∇g(X 1).(46) B.4 Proof of Proposition 3.2 We use the standard SOC reduction for reward-tilted sampling [6, 13]: under the SOC problem (4)– (5) with terminal co...

work page 2000

[46] [46]

CFG is known to improve alignment between the generated image and the text prompt, but this comes at the cost of doubled NFEs required to compute the unconditional prediction

in this section. CFG is known to improve alignment between the generated image and the text prompt, but this comes at the cost of doubled NFEs required to compute the unconditional prediction. As shown in Tab. 4, applying CFG improves improves all the metrics except Aesthetics. D Additional Qualitative Examples iv S D 3 . 5 S D 3 . 5 + C F G A M A M + M u...

work page