Recognition: no theorem link
Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
Pith reviewed 2026-05-12 04:33 UTC · model grok-4.3
The pith
KL-regularized reward maximization for diffusion and flow models leaves the noising law unchanged while tilting clean endpoints toward higher-reward samples, enabling a simple consistency loss for RL post-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that under KL-regularized reward maximization the optimal generative process tilts the clean-endpoint distribution toward higher-reward samples while leaving the noising law unchanged. This fact, together with the adjoint-matching optimality condition and a REINFORCE identity, directly yields Reinforce Adjoint Matching: a consistency loss that corrects the pretraining regression target with the reward. At each training step one simply samples a clean endpoint from the current model, evaluates its reward, applies the same analytical noising used in pretraining, and regresses against the adjusted target. No SDE rollouts, adjoint sweeps, or reward gradients are required.
What carries the argument
Reinforce Adjoint Matching (RAM) consistency loss, which uses a REINFORCE identity on the adjoint-matching condition to adjust the pretraining target by the reward while preserving the fixed noising process.
Load-bearing premise
The assumption that KL-regularized reward maximization leaves the noising law unchanged and that the adjoint-matching optimality condition extends directly to the RL post-training setting.
What would settle it
Apply RAM to Stable Diffusion 3.5M or a comparable model for a fixed number of steps and compare final reward scores on composability, text rendering, and preference tasks against rollout-based baselines; if RAM fails to reach comparable rewards or requires more steps than the reported factor of 50, the central claim does not hold.
read the original abstract
Diffusion and flow-matching models scale because pretraining is supervised regression: a clean sample is noised analytically, and a model regresses against a closed-form target. RL post-training aligns the model with a reward. In image generation, this makes samples compose objects correctly, render text legibly, and match human preferences. Existing methods rely on costly SDE rollouts, reward gradients, or surrogate losses, sacrificing pretraining's regression structure. We show that the structure extends to RL post-training. Under KL-regularized reward maximization, the optimal generative process tilts the clean-endpoint distribution towards samples with higher reward and leaves the noising law unchanged. Combining this with the adjoint-matching optimality condition and a REINFORCE identity, we derive Reinforce Adjoint Matching (RAM): a consistency loss that corrects the pretraining target with the reward. At each step, we draw a clean endpoint from the current model, evaluate its reward, noise it as in pretraining, and regress. No SDE rollouts, backward adjoint sweeps, or reward gradients are required. Like the pretraining objective, RAM is simple and scales. On Stable Diffusion 3.5M, RAM achieves the highest reward on composability, text rendering, and human preference, reaching Flow-GRPO's peak reward in up to $50\times$ fewer training steps.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to derive Reinforce Adjoint Matching (RAM), a method that extends the supervised regression structure of pretraining to RL post-training for diffusion and flow-matching models. Under KL-regularized reward maximization, the optimal generative process tilts the clean-endpoint distribution but leaves the noising law unchanged. By combining this with the adjoint-matching optimality condition and a REINFORCE identity, RAM is a consistency loss that corrects the pretraining target with the reward. This allows training by sampling clean endpoints from the current model, evaluating reward, noising, and regressing, without SDE rollouts, adjoint sweeps, or reward gradients. Experiments on Stable Diffusion 3.5M demonstrate superior performance on composability, text rendering, and human preference rewards, achieving Flow-GRPO's peak reward in up to 50 times fewer training steps.
Significance. If the derivation holds and the empirical results are reproducible, this is significant because it preserves the simple, scalable supervised-regression structure of pretraining when moving to RL alignment. This could enable efficient scaling of post-training for large generative models without introducing costly rollouts or gradients. The reported gains in reward and training efficiency on Stable Diffusion 3.5M suggest practical impact for alignment tasks like composability and text rendering.
major comments (2)
- The central derivation (described in the abstract and likely detailed in the main theoretical section): the claim that the noising law remains identical to pretraining under KL-regularized reward maximization is load-bearing for RAM being an unbiased correction to the pretraining target. The manuscript must supply the full step-by-step derivation from the stated optimality conditions to this invariance, including any intermediate expectations, to confirm no hidden correction terms arise when extending the adjoint-matching condition to the RL setting.
- Experiments section (results on Stable Diffusion 3.5M): the 50× speedup claim over Flow-GRPO is a key empirical assertion. The paper should report the precise step counts at which each method reaches its reported peak reward, include reward-vs-steps curves with variance across seeds, and confirm that the comparison controls for batch size, learning rate, and total compute to substantiate the factor.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive feedback on our manuscript. We believe the suggested revisions will strengthen the paper and are happy to incorporate them.
read point-by-point responses
-
Referee: The central derivation (described in the abstract and likely detailed in the main theoretical section): the claim that the noising law remains identical to pretraining under KL-regularized reward maximization is load-bearing for RAM being an unbiased correction to the pretraining target. The manuscript must supply the full step-by-step derivation from the stated optimality conditions to this invariance, including any intermediate expectations, to confirm no hidden correction terms arise when extending the adjoint-matching condition to the RL setting.
Authors: We agree that a detailed, step-by-step derivation is essential to rigorously establish the invariance of the noising law. While the manuscript presents the key optimality conditions and the resulting invariance, we will expand this in the revision by providing a complete proof. This will start from the KL-regularized reward maximization objective, derive the tilted clean-endpoint distribution, show that the noising process remains unchanged, and integrate it with the adjoint-matching condition and REINFORCE identity to obtain the RAM loss. We will include all intermediate expectations and explicitly address why no additional correction terms appear. This expanded derivation will be added to the main text or as a dedicated appendix section. revision: yes
-
Referee: Experiments section (results on Stable Diffusion 3.5M): the 50× speedup claim over Flow-GRPO is a key empirical assertion. The paper should report the precise step counts at which each method reaches its reported peak reward, include reward-vs-steps curves with variance across seeds, and confirm that the comparison controls for batch size, learning rate, and total compute to substantiate the factor.
Authors: We thank the referee for highlighting the need for more detailed experimental reporting to support the efficiency claims. In the revised manuscript, we will include precise step counts for when each method reaches its peak reward. We will add reward-versus-training-steps plots, including error bars or variance across multiple random seeds. Furthermore, we will explicitly document that the comparisons were performed under matched conditions, including identical batch sizes, learning rates, and equivalent total computational resources, to fairly substantiate the reported speedup factor. revision: yes
Circularity Check
No significant circularity: derivation chains from external optimality conditions
full rationale
The paper's central derivation starts from the stated property of KL-regularized reward maximization (optimal process tilts clean-endpoint marginal while leaving noising law unchanged) and combines it with the adjoint-matching optimality condition plus a REINFORCE identity to obtain the RAM consistency loss. This structure is presented as extending the supervised regression of pretraining without any reduction of the target to a fitted parameter or self-referential definition. No load-bearing step reduces by construction to the paper's own inputs or to a self-citation chain; the REINFORCE step is framed as an identity applied to expectations over the model. The method remains a direct correction to the pretraining regression target, preserving the original supervised form.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Under KL-regularized reward maximization, the optimal generative process tilts the clean-endpoint distribution towards samples with higher reward and leaves the noising law unchanged.
- domain assumption The adjoint-matching optimality condition applies to the RL-augmented generative process.
Reference graph
Works this paper leans on
-
[1]
The Twelfth International Conference on Learning Representations , year=
Training diffusion models with reinforcement learning , author=. The Twelfth International Conference on Learning Representations , year=
-
[2]
Ren, Allen Z. and Lidard, Justin and Ankile, Lars Lien and Simeonov, Anthony and Agrawal, Pulkit and Majumdar, Anirudha and Burchfiel, Benjamin and Dai, Hongkai and Simchowitz, Max , title=. The Thirteenth International Conference on Learning Representations , year=
-
[3]
Fan, Ying and Watkins, Olivia and Du, Yuqing and Liu, Hao and Ryu, Moonkyung and Boutilier, Craig and Abbeel, Pieter and Ghavamzadeh, Mohammad and Lee, Kangwook and Lee, Kimin , booktitle=
-
[4]
The Thirteenth International Conference on Learning Representations , year=
Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control , author=. The Thirteenth International Conference on Learning Representations , year=
-
[5]
A Taxonomy of Loss Functions for Stochastic Optimal Control, October 2024
A taxonomy of loss functions for stochastic optimal control , author=. arXiv preprint arXiv:2410.00345 , year=
-
[6]
Advances in Neural Information Processing Systems , volume=
Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Advances in Neural Information Processing Systems , volume=
-
[7]
Fine-tuning of continuous-time diffusion models as entropy-regularized control , author=. arXiv preprint arXiv:2402.15194 , year=
-
[8]
Advances in Neural Information Processing Systems , volume=
Preference Alignment with Flow Matching , author=. Advances in Neural Information Processing Systems , volume=
-
[9]
Journal of Machine Learning Research , volume=
Reward-Directed Score-Based Diffusion Models via q-Learning , author=. Journal of Machine Learning Research , volume=
-
[10]
Proceedings of the 42nd International Conference on Machine Learning , series=
Score as Action: Fine-Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning , author=. Proceedings of the 42nd International Conference on Machine Learning , series=
-
[11]
Advances in Neural Information Processing Systems , volume=
Neural Ordinary Differential Equations , author=. Advances in Neural Information Processing Systems , volume=
-
[12]
Proceedings of the 42nd International Conference on Machine Learning , series=
Conditioning Diffusions Using Malliavin Calculus , author=. Proceedings of the 42nd International Conference on Machine Learning , series=
-
[13]
Mathematical theory of optimal processes , author=
-
[14]
Proceedings of the 32nd International Conference on Machine Learning , series=
Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author=. Proceedings of the 32nd International Conference on Machine Learning , series=
-
[15]
Advances in Neural Information Processing Systems , volume=
Denoising Diffusion Probabilistic Models , author=. Advances in Neural Information Processing Systems , volume=
-
[16]
The Ninth International Conference on Learning Representations , year=
Score-Based Generative Modeling through Stochastic Differential Equations , author=. The Ninth International Conference on Learning Representations , year=
-
[17]
The Eleventh International Conference on Learning Representations , year=
Flow Matching for Generative Modeling , author=. The Eleventh International Conference on Learning Representations , year=
-
[18]
The Eleventh International Conference on Learning Representations , year=
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=. The Eleventh International Conference on Learning Representations , year=
-
[19]
Advances in Neural Information Processing Systems , volume=
Flow-GRPO: Training Flow Matching Models via Online RL , author=. Advances in Neural Information Processing Systems , volume=
-
[20]
The Fourteenth International Conference on Learning Representations , year=
DiffusionNFT: Online Diffusion Reinforcement with Forward Process , author=. The Fourteenth International Conference on Learning Representations , year=
-
[21]
arXiv preprint arXiv:2509.25050 , year=
Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models , author=. arXiv preprint arXiv:2509.25050 , year=
-
[22]
arXiv preprint arXiv:2602.04663 , year=
Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design , author=. arXiv preprint arXiv:2602.04663 , year=
-
[23]
Advances in Neural Information Processing Systems , volume=
ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation , author=. Advances in Neural Information Processing Systems , volume=
-
[24]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
Teaching Large Language Models to Regress Accurate Image Quality Scores Using Score Distribution , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
-
[25]
The Twelfth International Conference on Learning Representations , year=
Directly Fine-Tuning Diffusion Models on Differentiable Rewards , author=. The Twelfth International Conference on Learning Representations , year=
-
[26]
Training language models to follow instructions with human feedback , volume =
Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...
-
[27]
Advances in Neural Information Processing Systems , volume=
Linearly-solvable Markov decision problems , author=. Advances in Neural Information Processing Systems , volume=
-
[28]
Fine-Tuning Language Models from Human Preferences
Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[29]
Journal of Machine Learning Research , volume=
Stochastic interpolants: A unifying framework for flows and diffusions , author=. Journal of Machine Learning Research , volume=
-
[30]
Stochastic differential equations: an introduction with applications , author=. 2013 , publisher=
work page 2013
-
[31]
Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal=
-
[32]
Proceedings of the 42nd International Conference on Machine Learning , series=
Adjoint Sampling: Highly Scalable Diffusion Samplers via Adjoint Matching , author=. Proceedings of the 42nd International Conference on Machine Learning , series=
-
[33]
Continuous-time stochastic control and optimization with financial applications , author=. 2009 , publisher=
work page 2009
-
[34]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn , title=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
-
[35]
Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu and others , booktitle=
-
[36]
Mathematical theory of optimal processes , author=. 2018 , publisher=
work page 2018
- [37]
-
[38]
Meta Flow Maps enable scalable reward alignment , author=. arXiv preprint arXiv:2601.14430 , year=
-
[39]
Advances in Neural Information Processing Systems , volume=
GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment , author=. Advances in Neural Information Processing Systems , volume=
-
[40]
The Third International Conference on Learning Representations , year=
Explaining and Harnessing Adversarial Examples , author=. The Third International Conference on Learning Representations , year=
-
[41]
Advances in Neural Information Processing Systems , volume=
Diffusion Models Beat GANs on Image Synthesis , author=. Advances in Neural Information Processing Systems , volume=
-
[42]
Control Consistency Losses for Diffusion Bridges
Control Consistency Losses for Diffusion Bridges , author=. arXiv preprint arXiv:2512.05070 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Stochastic Processes and their Applications , volume=
Conditioned stochastic differential equations: theory, examples and application to finance , author=. Stochastic Processes and their Applications , volume=. 2002 , publisher=
work page 2002
-
[44]
Representation formula for the entropy and functional inequalities , author=. Annales de l'IHP Probabilit
-
[45]
Stochastic Processes and their Applications , volume=
Reverse-time diffusion equation models , author=. Stochastic Processes and their Applications , volume=. 1982 , publisher=
work page 1982
-
[46]
Proceedings of the 41st International Conference on Machine Learning , series=
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. Proceedings of the 41st International Conference on Machine Learning , series=
-
[47]
Haoyou Deng and Keyu Yan and Chaojie Mao and Xiang Wang and Yu Liu and Changxin Gao and Nong Sang , booktitle=. Dense
-
[48]
Advances in Neural Information Processing Systems , volume=
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation , author=. Advances in Neural Information Processing Systems , volume=
-
[49]
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis , author=. arXiv preprint arXiv:2306.09341 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
LAION-aesthetics , author=
-
[51]
Advances in Neural Information Processing Systems , volume=
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , author=. Advances in Neural Information Processing Systems , volume=
-
[52]
Aligning Text-to-Image Diffusion Models with Reward Backpropagation , author=. arXiv preprint arXiv:2310.03739 , year=
-
[53]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
Diffusion Model Alignment Using Direct Preference Optimization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
-
[54]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
-
[55]
Deng, Fei and Wang, Qifei and Wei, Wei and Hou, Tingbo and Grundmann, Matthias , booktitle=
-
[56]
Computer Vision -- ECCV 2024 , publisher=
Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models , author=. Computer Vision -- ECCV 2024 , publisher=
work page 2024
-
[57]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
DiffExp: Efficient Exploration in Reward Fine-tuning for Text-to-Image Diffusion Models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[58]
Proceedings of the 41st International Conference on Machine Learning , series=
A Dense Reward View on Aligning Text-to-Image Diffusion with Preference , author=. Proceedings of the 41st International Conference on Machine Learning , series=
-
[59]
and Liu, Weiyang and Bengio, Yoshua and Zhang, Dinghuai , booktitle=
Liu, Zhen and Xiao, Tim Z. and Liu, Weiyang and Bengio, Yoshua and Zhang, Dinghuai , booktitle=. Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.