arxiv: 2605.10759 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.CV

Recognition: no theorem link

Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models

Andreas Bergmeister , Stefanie Jegelka , Nikolas N\"usken , Carles Domingo-Enrich , Jakiw Pidstrigach

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:33 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords diffusion modelsflow matchingreinforcement learningpost-trainingadjoint matchingREINFORCEconsistency lossreward maximization

0 comments

The pith

KL-regularized reward maximization for diffusion and flow models leaves the noising law unchanged while tilting clean endpoints toward higher-reward samples, enabling a simple consistency loss for RL post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the efficient regression structure of diffusion and flow-matching pretraining can be preserved during RL post-training for reward alignment. Under KL-regularized reward maximization, the optimal generative process adjusts only the distribution of clean samples according to reward and leaves the analytical noising process fixed. Combining this observation with the adjoint-matching optimality condition and a REINFORCE identity produces Reinforce Adjoint Matching, a loss that draws a clean endpoint from the current model, scores its reward, noises it, and regresses against a corrected target. A sympathetic reader would care because this removes the need for SDE rollouts, backward sweeps, or reward gradients, allowing RL alignment to scale like pretraining. Experiments on Stable Diffusion 3.5M show it reaches peak rewards on composability, text rendering, and human preferences in far fewer steps than prior methods.

Core claim

The central claim is that under KL-regularized reward maximization the optimal generative process tilts the clean-endpoint distribution toward higher-reward samples while leaving the noising law unchanged. This fact, together with the adjoint-matching optimality condition and a REINFORCE identity, directly yields Reinforce Adjoint Matching: a consistency loss that corrects the pretraining regression target with the reward. At each training step one simply samples a clean endpoint from the current model, evaluates its reward, applies the same analytical noising used in pretraining, and regresses against the adjusted target. No SDE rollouts, adjoint sweeps, or reward gradients are required.

What carries the argument

Reinforce Adjoint Matching (RAM) consistency loss, which uses a REINFORCE identity on the adjoint-matching condition to adjust the pretraining target by the reward while preserving the fixed noising process.

Load-bearing premise

The assumption that KL-regularized reward maximization leaves the noising law unchanged and that the adjoint-matching optimality condition extends directly to the RL post-training setting.

What would settle it

Apply RAM to Stable Diffusion 3.5M or a comparable model for a fixed number of steps and compare final reward scores on composability, text rendering, and preference tasks against rollout-based baselines; if RAM fails to reach comparable rewards or requires more steps than the reported factor of 50, the central claim does not hold.

read the original abstract

Diffusion and flow-matching models scale because pretraining is supervised regression: a clean sample is noised analytically, and a model regresses against a closed-form target. RL post-training aligns the model with a reward. In image generation, this makes samples compose objects correctly, render text legibly, and match human preferences. Existing methods rely on costly SDE rollouts, reward gradients, or surrogate losses, sacrificing pretraining's regression structure. We show that the structure extends to RL post-training. Under KL-regularized reward maximization, the optimal generative process tilts the clean-endpoint distribution towards samples with higher reward and leaves the noising law unchanged. Combining this with the adjoint-matching optimality condition and a REINFORCE identity, we derive Reinforce Adjoint Matching (RAM): a consistency loss that corrects the pretraining target with the reward. At each step, we draw a clean endpoint from the current model, evaluate its reward, noise it as in pretraining, and regress. No SDE rollouts, backward adjoint sweeps, or reward gradients are required. Like the pretraining objective, RAM is simple and scales. On Stable Diffusion 3.5M, RAM achieves the highest reward on composability, text rendering, and human preference, reaching Flow-GRPO's peak reward in up to $50\times$ fewer training steps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a regression-style loss for RL post-training of diffusion and flow models that avoids rollouts and reports large efficiency gains on Stable Diffusion.

read the letter

The main point is that this work keeps the supervised regression structure of pretraining intact when doing RL post-training on diffusion and flow-matching models. They start from KL-regularized reward maximization, note that the optimal process only shifts the clean-endpoint distribution while leaving the noising law unchanged, then combine adjoint matching with a REINFORCE identity to get a consistency loss that simply corrects the pretraining target by the reward. In practice you sample an endpoint from the current model, score it, noise it exactly as in pretraining, and regress. No SDE rollouts, adjoint sweeps, or reward gradients are required.

Referee Report

2 major / 0 minor

Summary. The paper claims to derive Reinforce Adjoint Matching (RAM), a method that extends the supervised regression structure of pretraining to RL post-training for diffusion and flow-matching models. Under KL-regularized reward maximization, the optimal generative process tilts the clean-endpoint distribution but leaves the noising law unchanged. By combining this with the adjoint-matching optimality condition and a REINFORCE identity, RAM is a consistency loss that corrects the pretraining target with the reward. This allows training by sampling clean endpoints from the current model, evaluating reward, noising, and regressing, without SDE rollouts, adjoint sweeps, or reward gradients. Experiments on Stable Diffusion 3.5M demonstrate superior performance on composability, text rendering, and human preference rewards, achieving Flow-GRPO's peak reward in up to 50 times fewer training steps.

Significance. If the derivation holds and the empirical results are reproducible, this is significant because it preserves the simple, scalable supervised-regression structure of pretraining when moving to RL alignment. This could enable efficient scaling of post-training for large generative models without introducing costly rollouts or gradients. The reported gains in reward and training efficiency on Stable Diffusion 3.5M suggest practical impact for alignment tasks like composability and text rendering.

major comments (2)

The central derivation (described in the abstract and likely detailed in the main theoretical section): the claim that the noising law remains identical to pretraining under KL-regularized reward maximization is load-bearing for RAM being an unbiased correction to the pretraining target. The manuscript must supply the full step-by-step derivation from the stated optimality conditions to this invariance, including any intermediate expectations, to confirm no hidden correction terms arise when extending the adjoint-matching condition to the RL setting.
Experiments section (results on Stable Diffusion 3.5M): the 50× speedup claim over Flow-GRPO is a key empirical assertion. The paper should report the precise step counts at which each method reaches its reported peak reward, include reward-vs-steps curves with variance across seeds, and confirm that the comparison controls for batch size, learning rate, and total compute to substantiate the factor.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback on our manuscript. We believe the suggested revisions will strengthen the paper and are happy to incorporate them.

read point-by-point responses

Referee: The central derivation (described in the abstract and likely detailed in the main theoretical section): the claim that the noising law remains identical to pretraining under KL-regularized reward maximization is load-bearing for RAM being an unbiased correction to the pretraining target. The manuscript must supply the full step-by-step derivation from the stated optimality conditions to this invariance, including any intermediate expectations, to confirm no hidden correction terms arise when extending the adjoint-matching condition to the RL setting.

Authors: We agree that a detailed, step-by-step derivation is essential to rigorously establish the invariance of the noising law. While the manuscript presents the key optimality conditions and the resulting invariance, we will expand this in the revision by providing a complete proof. This will start from the KL-regularized reward maximization objective, derive the tilted clean-endpoint distribution, show that the noising process remains unchanged, and integrate it with the adjoint-matching condition and REINFORCE identity to obtain the RAM loss. We will include all intermediate expectations and explicitly address why no additional correction terms appear. This expanded derivation will be added to the main text or as a dedicated appendix section. revision: yes
Referee: Experiments section (results on Stable Diffusion 3.5M): the 50× speedup claim over Flow-GRPO is a key empirical assertion. The paper should report the precise step counts at which each method reaches its reported peak reward, include reward-vs-steps curves with variance across seeds, and confirm that the comparison controls for batch size, learning rate, and total compute to substantiate the factor.

Authors: We thank the referee for highlighting the need for more detailed experimental reporting to support the efficiency claims. In the revised manuscript, we will include precise step counts for when each method reaches its peak reward. We will add reward-versus-training-steps plots, including error bars or variance across multiple random seeds. Furthermore, we will explicitly document that the comparisons were performed under matched conditions, including identical batch sizes, learning rates, and equivalent total computational resources, to fairly substantiate the reported speedup factor. revision: yes

Circularity Check

0 steps flagged

No significant circularity: derivation chains from external optimality conditions

full rationale

The paper's central derivation starts from the stated property of KL-regularized reward maximization (optimal process tilts clean-endpoint marginal while leaving noising law unchanged) and combines it with the adjoint-matching optimality condition plus a REINFORCE identity to obtain the RAM consistency loss. This structure is presented as extending the supervised regression of pretraining without any reduction of the target to a fitted parameter or self-referential definition. No load-bearing step reduces by construction to the paper's own inputs or to a self-citation chain; the REINFORCE step is framed as an identity applied to expectations over the model. The method remains a direct correction to the pretraining regression target, preserving the original supervised form.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about optimality under KL regularization and the applicability of adjoint matching; no free parameters or invented entities are introduced in the abstract description.

axioms (2)

domain assumption Under KL-regularized reward maximization, the optimal generative process tilts the clean-endpoint distribution towards samples with higher reward and leaves the noising law unchanged.
This is the load-bearing optimality condition invoked to derive the corrected regression target.
domain assumption The adjoint-matching optimality condition applies to the RL-augmented generative process.
Combined with the KL condition and REINFORCE identity to obtain the consistency loss.

pith-pipeline@v0.9.0 · 5564 in / 1707 out tokens · 65199 ms · 2026-05-12T04:33:30.611342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

[1]

The Twelfth International Conference on Learning Representations , year=

Training diffusion models with reinforcement learning , author=. The Twelfth International Conference on Learning Representations , year=

work page
[2]

and Lidard, Justin and Ankile, Lars Lien and Simeonov, Anthony and Agrawal, Pulkit and Majumdar, Anirudha and Burchfiel, Benjamin and Dai, Hongkai and Simchowitz, Max , title=

Ren, Allen Z. and Lidard, Justin and Ankile, Lars Lien and Simeonov, Anthony and Agrawal, Pulkit and Majumdar, Anirudha and Burchfiel, Benjamin and Dai, Hongkai and Simchowitz, Max , title=. The Thirteenth International Conference on Learning Representations , year=

work page
[3]

Fan, Ying and Watkins, Olivia and Du, Yuqing and Liu, Hao and Ryu, Moonkyung and Boutilier, Craig and Abbeel, Pieter and Ghavamzadeh, Mohammad and Lee, Kangwook and Lee, Kimin , booktitle=

work page
[4]

The Thirteenth International Conference on Learning Representations , year=

Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[5]

A Taxonomy of Loss Functions for Stochastic Optimal Control, October 2024

A taxonomy of loss functions for stochastic optimal control , author=. arXiv preprint arXiv:2410.00345 , year=

work page arXiv
[6]

Advances in Neural Information Processing Systems , volume=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

L., Tseng, A

Fine-tuning of continuous-time diffusion models as entropy-regularized control , author=. arXiv preprint arXiv:2402.15194 , year=

work page arXiv
[8]

Advances in Neural Information Processing Systems , volume=

Preference Alignment with Flow Matching , author=. Advances in Neural Information Processing Systems , volume=

work page
[9]

Journal of Machine Learning Research , volume=

Reward-Directed Score-Based Diffusion Models via q-Learning , author=. Journal of Machine Learning Research , volume=

work page
[10]

Proceedings of the 42nd International Conference on Machine Learning , series=

Score as Action: Fine-Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning , author=. Proceedings of the 42nd International Conference on Machine Learning , series=

work page
[11]

Advances in Neural Information Processing Systems , volume=

Neural Ordinary Differential Equations , author=. Advances in Neural Information Processing Systems , volume=

work page
[12]

Proceedings of the 42nd International Conference on Machine Learning , series=

Conditioning Diffusions Using Malliavin Calculus , author=. Proceedings of the 42nd International Conference on Machine Learning , series=

work page
[13]

Mathematical theory of optimal processes , author=

work page
[14]

Proceedings of the 32nd International Conference on Machine Learning , series=

Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author=. Proceedings of the 32nd International Conference on Machine Learning , series=

work page
[15]

Advances in Neural Information Processing Systems , volume=

Denoising Diffusion Probabilistic Models , author=. Advances in Neural Information Processing Systems , volume=

work page
[16]

The Ninth International Conference on Learning Representations , year=

Score-Based Generative Modeling through Stochastic Differential Equations , author=. The Ninth International Conference on Learning Representations , year=

work page
[17]

The Eleventh International Conference on Learning Representations , year=

Flow Matching for Generative Modeling , author=. The Eleventh International Conference on Learning Representations , year=

work page
[18]

The Eleventh International Conference on Learning Representations , year=

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=. The Eleventh International Conference on Learning Representations , year=

work page
[19]

Advances in Neural Information Processing Systems , volume=

Flow-GRPO: Training Flow Matching Models via Online RL , author=. Advances in Neural Information Processing Systems , volume=

work page
[20]

The Fourteenth International Conference on Learning Representations , year=

DiffusionNFT: Online Diffusion Reinforcement with Forward Process , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[21]

arXiv preprint arXiv:2509.25050 , year=

Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models , author=. arXiv preprint arXiv:2509.25050 , year=

work page arXiv
[22]

arXiv preprint arXiv:2602.04663 , year=

Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design , author=. arXiv preprint arXiv:2602.04663 , year=

work page arXiv
[23]

Advances in Neural Information Processing Systems , volume=

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation , author=. Advances in Neural Information Processing Systems , volume=

work page
[24]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Teaching Large Language Models to Regress Accurate Image Quality Scores Using Score Distribution , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page
[25]

The Twelfth International Conference on Learning Representations , year=

Directly Fine-Tuning Diffusion Models on Differentiable Rewards , author=. The Twelfth International Conference on Learning Representations , year=

work page
[26]

Training language models to follow instructions with human feedback , volume =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

work page
[27]

Advances in Neural Information Processing Systems , volume=

Linearly-solvable Markov decision problems , author=. Advances in Neural Information Processing Systems , volume=

work page
[28]

Fine-Tuning Language Models from Human Preferences

Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[29]

Journal of Machine Learning Research , volume=

Stochastic interpolants: A unifying framework for flows and diffusions , author=. Journal of Machine Learning Research , volume=

work page
[30]

2013 , publisher=

Stochastic differential equations: an introduction with applications , author=. 2013 , publisher=

work page 2013
[31]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal=

work page
[32]

Proceedings of the 42nd International Conference on Machine Learning , series=

Adjoint Sampling: Highly Scalable Diffusion Samplers via Adjoint Matching , author=. Proceedings of the 42nd International Conference on Machine Learning , series=

work page
[33]

2009 , publisher=

Continuous-time stochastic control and optimization with financial applications , author=. 2009 , publisher=

work page 2009
[34]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn , title=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[35]

Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu and others , booktitle=

work page
[36]

2018 , publisher=

Mathematical theory of optimal processes , author=. 2018 , publisher=

work page 2018
[37]

(2022, 2)

On neural differential equations , author=. arXiv preprint arXiv:2202.02435 , year=

work page arXiv
[38]

Potaptchik, A

Meta Flow Maps enable scalable reward alignment , author=. arXiv preprint arXiv:2601.14430 , year=

work page arXiv
[39]

Advances in Neural Information Processing Systems , volume=

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment , author=. Advances in Neural Information Processing Systems , volume=

work page
[40]

The Third International Conference on Learning Representations , year=

Explaining and Harnessing Adversarial Examples , author=. The Third International Conference on Learning Representations , year=

work page
[41]

Advances in Neural Information Processing Systems , volume=

Diffusion Models Beat GANs on Image Synthesis , author=. Advances in Neural Information Processing Systems , volume=

work page
[42]

Control Consistency Losses for Diffusion Bridges

Control Consistency Losses for Diffusion Bridges , author=. arXiv preprint arXiv:2512.05070 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Stochastic Processes and their Applications , volume=

Conditioned stochastic differential equations: theory, examples and application to finance , author=. Stochastic Processes and their Applications , volume=. 2002 , publisher=

work page 2002
[44]

Annales de l'IHP Probabilit

Representation formula for the entropy and functional inequalities , author=. Annales de l'IHP Probabilit

work page
[45]

Stochastic Processes and their Applications , volume=

Reverse-time diffusion equation models , author=. Stochastic Processes and their Applications , volume=. 1982 , publisher=

work page 1982
[46]

Proceedings of the 41st International Conference on Machine Learning , series=

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. Proceedings of the 41st International Conference on Machine Learning , series=

work page
[47]

Haoyou Deng and Keyu Yan and Chaojie Mao and Xiang Wang and Yu Liu and Changxin Gao and Nong Sang , booktitle=. Dense

work page
[48]

Advances in Neural Information Processing Systems , volume=

Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation , author=. Advances in Neural Information Processing Systems , volume=

work page
[49]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis , author=. arXiv preprint arXiv:2306.09341 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

LAION-aesthetics , author=

work page
[51]

Advances in Neural Information Processing Systems , volume=

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , author=. Advances in Neural Information Processing Systems , volume=

work page
[52]

Aligning text-to- image diffusion models with reward backpropagation (2023).arXiv preprint arXiv:2310.03739,

Aligning Text-to-Image Diffusion Models with Reward Backpropagation , author=. arXiv preprint arXiv:2310.03739 , year=

work page arXiv
[53]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Diffusion Model Alignment Using Direct Preference Optimization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page
[54]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page
[55]

Deng, Fei and Wang, Qifei and Wei, Wei and Hou, Tingbo and Grundmann, Matthias , booktitle=

work page
[56]

Computer Vision -- ECCV 2024 , publisher=

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models , author=. Computer Vision -- ECCV 2024 , publisher=

work page 2024
[57]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

DiffExp: Efficient Exploration in Reward Fine-tuning for Text-to-Image Diffusion Models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[58]

Proceedings of the 41st International Conference on Machine Learning , series=

A Dense Reward View on Aligning Text-to-Image Diffusion with Preference , author=. Proceedings of the 41st International Conference on Machine Learning , series=

work page
[59]

and Liu, Weiyang and Bengio, Yoshua and Zhang, Dinghuai , booktitle=

Liu, Zhen and Xiao, Tim Z. and Liu, Weiyang and Bengio, Yoshua and Zhang, Dinghuai , booktitle=. Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed

work page