Recognition: unknown
Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models
Pith reviewed 2026-05-10 05:30 UTC · model grok-4.3
The pith
Many reward-based fine-tuning methods for diffusion and flow models reduce to a single score-matching objective against a value-guided target.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reward-based fine-tuning steers a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to the pretrained model. Many existing methods can be written under the common framework of reward score matching, where alignment becomes score matching against a value-guided target. The main differences across methods reduce to the construction of the value-guidance estimator and the effective optimization strength across timesteps. This view clarifies the bias-variance-compute tradeoffs of existing designs and distinguishes core optimization components from auxiliary mechanisms.
What carries the argument
Reward score matching (RSM): the objective of matching the generative model's score to a value-guided target score, where the target incorporates reward information.
If this is right
- Existing methods' performance differences arise mainly from bias-variance-compute tradeoffs in estimator choice and timestep weighting.
- Auxiliary mechanisms that add complexity without altering the core score-matching objective can be removed without loss.
- Simpler redesigns become possible for both differentiable and black-box reward alignment tasks.
- The design space of reward-based fine-tuning shrinks to a smaller, more interpretable set of choices.
Where Pith is reading between the lines
- The same unification lens could be applied to fine-tuning of other score-based or flow-based generative models not covered in the current experiments.
- Practitioners could select estimator type and timestep schedule based on whether their reward signal is noisy or expensive to evaluate.
- Direct optimization of the unified RSM objective might yield new reward functions that bypass intermediate value estimation steps.
Load-bearing premise
The primary distinctions among existing methods reduce to the construction of the value-guidance estimator and the effective optimization strength across timesteps, without material loss of generality or overlooked auxiliary mechanisms.
What would settle it
Identification of a reward fine-tuning procedure whose update rule cannot be expressed as score matching to any value-guided target, or whose performance gains cannot be reproduced by varying only the estimator and timestep weighting within the RSM objective.
Figures
read the original abstract
Reward-based fine-tuning steers a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to the pretrained model. Although existing methods are derived from different perspectives, we show that many can be written under a common framework, which we call reward score matching (RSM). Under this view, alignment becomes score matching against a value-guided target, and the main differences across methods reduce to the construction of the value-guidance estimator and the effective optimization strength across timesteps. This unification clarifies the bias-variance-compute tradeoffs of existing designs, and distinguishes core optimization components from auxiliary mechanisms that add complexity without clear benefit. Guided by this perspective, we develop simpler, more efficient redesigns across representative differentiable and black-box reward alignment tasks. Overall, RSM turns a seemingly fragmented collection of reward-based fine-tuning methods into a smaller, more interpretable, and more actionable design space.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Reward Score Matching (RSM) as a unifying framework for reward-based fine-tuning of pretrained diffusion and flow models. It claims that many existing methods, derived from different perspectives, can be rewritten as score matching against a value-guided target distribution, with primary differences reducing to the construction of the value-guidance estimator and the effective optimization strength (weighting) across timesteps. Guided by this view, the authors distinguish core optimization from auxiliary mechanisms and propose simpler, more efficient redesigns for both differentiable and black-box reward alignment tasks.
Significance. If the unification holds with the claimed lack of material loss of generality, the work provides a valuable organizing lens that clarifies bias-variance-compute tradeoffs and reduces the apparent fragmentation of reward fine-tuning methods into a smaller design space. This could facilitate more interpretable and actionable method development. The contribution is conceptual rather than algorithmic, with strength in the reported redesigns; no machine-checked proofs or parameter-free derivations are claimed.
major comments (2)
- [§3 (RSM framework)] The central unification claim (that existing methods can be rewritten under RSM without loss of the original behavior) is load-bearing but presented at a high level in the abstract; explicit derivations for representative methods (e.g., the value-guided target and weighting schedule for at least two standard baselines) must be shown in the main text to confirm preservation of objectives and rule out overlooked auxiliary mechanisms.
- [Experiments] The redesigns are asserted to be simpler and more efficient, but the experiments section must include direct quantitative comparisons (performance, compute, variance) against the original methods being unified; without these, the practical benefit of the RSM-guided simplifications remains unsubstantiated.
minor comments (2)
- Notation for the value-guidance estimator and per-timestep weighting should be introduced with a single consistent definition early in the paper and used uniformly in all equations.
- [Abstract] The abstract states that auxiliary mechanisms 'add complexity without clear benefit'; this phrasing should be softened or supported by a brief reference to the specific ablations that demonstrate the lack of benefit.
Simulated Author's Rebuttal
We thank the referee for the constructive review and recommendation of minor revision. We address each major comment point by point below, agreeing where the suggestions strengthen the presentation and providing the requested additions in the revised manuscript.
read point-by-point responses
-
Referee: [§3 (RSM framework)] The central unification claim (that existing methods can be rewritten under RSM without loss of the original behavior) is load-bearing but presented at a high level in the abstract; explicit derivations for representative methods (e.g., the value-guided target and weighting schedule for at least two standard baselines) must be shown in the main text to confirm preservation of objectives and rule out overlooked auxiliary mechanisms.
Authors: We agree that explicit derivations will make the unification claim more rigorous and verifiable. In the revised §3, we have added a dedicated subsection with full step-by-step derivations for two representative baselines: one differentiable reward method (e.g., diffusion DPO) and one black-box method (e.g., DDPO). For each, we explicitly derive the value-guided target distribution and the corresponding timestep weighting schedule, showing that the original objective is recovered exactly as score matching under RSM with no additional auxiliary mechanisms required. These derivations confirm preservation of behavior and clarify how differences reduce to estimator construction and weighting. revision: yes
-
Referee: [Experiments] The redesigns are asserted to be simpler and more efficient, but the experiments section must include direct quantitative comparisons (performance, compute, variance) against the original methods being unified; without these, the practical benefit of the RSM-guided simplifications remains unsubstantiated.
Authors: We acknowledge that direct comparisons are necessary to substantiate the practical advantages of the RSM redesigns. The revised experiments section now includes head-to-head quantitative evaluations on both differentiable and black-box tasks. We report reward alignment performance, wall-clock training time, memory usage, and empirical variance (across seeds) for the RSM-based methods versus the original baselines. The results demonstrate that the simplifications achieve comparable or superior performance with lower compute and variance, validating the efficiency claims. revision: yes
Circularity Check
No significant circularity; unification is an independent re-expression
full rationale
The paper algebraically rewrites existing reward-based fine-tuning objectives for diffusion and flow models as score matching against a value-guided target, with method differences isolated to the value estimator construction and per-timestep weighting. This re-expression does not reduce any core claim to a fitted input renamed as prediction, a self-citation chain, or a definitional loop; the derivations remain self-contained against the cited prior methods and do not invoke uniqueness theorems or ansatzes from the authors' own prior work. The framework functions as an organizing view that clarifies tradeoffs without forcing results by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline
A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.
Reference graph
Works this paper leans on
-
[1]
Boffi, and Eric Vanden-Eijnden
Michael Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209): 1–80, 2025. URLhttp://jmlr.org/papers/v26/23-1605.html
2025
-
[2]
A markovian decision process.Journal of Mathematics and Mechanics, 6(5): 679–684, 1957
Richard Bellman. A markovian decision process.Journal of Mathematics and Mechanics, 6(5): 679–684, 1957
1957
-
[3]
Training diffusion models with reinforcement learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=YCWjhGrJFD
2024
-
[4]
arXiv preprint arXiv:2602.04663 , year=
Jaemoo Choi, Yuchen Zhu, Wei Guo, Petr Molodyk, Bo Yuan, Jinbin Bai, Yi Xin, Molei Tao, and Yongxin Chen. Rethinking the design space of reinforcement learning for diffusion models: On the importance of likelihood estimation beyond loss design, 2026. URL https: //arxiv.org/abs/2602.04663
-
[5]
Perception prioritized training of diffusion models
Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception prioritized training of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11472–11481, June 2022
2022
-
[6]
Diffusion posterior sampling for general noisy inverse problems
Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. InThe Eleventh Inter- national Conference on Learning Representations, 2023. URL https://openreview.net/ forum?id=OnD9zGAGT0k
2023
-
[7]
Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky T. Q. Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=xQBRrtQM8u
2025
-
[8]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024. URL https://arxiv.org/abs/2403.03206
work page internal anchor Pith review arXiv 2024
-
[9]
Reinforcement learning for fine- tuning text-to-image diffusion models
Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Reinforcement learning for fine- tuning text-to-image diffusion models. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=8OTPepXzeh
2023
-
[10]
Dreamsim: Learning new dimensions of human visual similarity using synthetic data
Stephanie Fu, Netanel Yakir Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=DEiNSfh1k7
2023
-
[11]
Geneval: An object-focused framework for evaluating text-to-image alignment
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https:// openreview.net/forum?id=Wbr51vK331. 12
2023
-
[12]
Tempflow-grpo: When timing matters for grpo in flow models
Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, and Bo Zhang. Tempflow-grpo: When timing matters for grpo in flow models. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=7mCo3R3Wyn
2026
-
[13]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. URLhttps://openreview. net/forum?id=qw8AKxfYbI
2021
-
[14]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851, 2020
2020
-
[15]
LoRA: Low-rank adaptation of large language models
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9
2022
-
[16]
Jian Huang, Yuling Jiao, Lican Kang, Xu Liao, Jin Liu, and Yanyan Liu. Schrödinger-föllmer sampler.IEEE Transactions on Information Theory, 71(2):1283–1299, 2025. doi: 10.1109/TIT. 2024.3522494
work page doi:10.1109/tit 2025
-
[17]
PPO-Clip attains global optimality: Towards deeper understandings of clipping
Nai-Chieh Huang, Ping-Chun Hsieh, Kuo-Hao Ho, and I-Chen Wu. PPO-Clip attains global optimality: Towards deeper understandings of clipping. InAAAI, pages 12600–12607, 2024
2024
-
[18]
Diffusion fine-tuning via reparameterized policy gradient of the soft q-function, 2026
Hyeongyu Kang, Jaewoo Lee, Woocheol Shin, Kiyoung Om, and Jinkyoo Park. Diffusion fine-tuning via reparameterized policy gradient of the soft q-function, 2026. URL https: //openreview.net/forum?id=8zoxC9e23q
2026
-
[19]
Pick-a-pic: an open dataset of user preferences for text-to-image generation
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: an open dataset of user preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc
2023
-
[20]
PCPO: Proportionate credit policy optimization for preference alignment of image generation models
Jeongjae Lee and Jong Chul Ye. PCPO: Proportionate credit policy optimization for preference alignment of image generation models. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=alY08iknli
2026
-
[21]
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde, 2025. URL https: //arxiv.org/abs/2507.21802
work page internal anchor Pith review arXiv 2025
-
[22]
BranchGRPO: Stable and efficient GRPO with structured branching in diffusion models
Yuming Li, Yikai Wang, Yuying zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. BranchGRPO: Stable and efficient GRPO with structured branching in diffusion models. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=T2nP2IQasd
2026
-
[23]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=PqvMRDCJT9t
2023
-
[24]
Flow-GRPO: Training flow matching models via online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online RL. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=oCBKGw5HNf
2025
-
[25]
Nabla-r2d3: Effective and efficient 3d diffusion alignment with 2d rewards
Qingming Liu, Zhen Liu, Dinghuai Zhang, and Kui Jia. Nabla-r2d3: Effective and efficient 3d diffusion alignment with 2d rewards. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2025. URLhttps://openreview.net/forum?id=Dk2qprCnu8
2025
-
[26]
Xiao, Carles Domingo-Enrich, Weiyang Liu, and Dinghuai Zhang
Zhen Liu, Tim Z. Xiao, Carles Domingo-Enrich, Weiyang Liu, and Dinghuai Zhang. Value gradient guidance for flow matching alignment. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id= 6MmOy2Ji8V. 13
2025
-
[27]
Xiao, Weiyang Liu, Yoshua Bengio, and Dinghuai Zhang
Zhen Liu, Tim Z. Xiao, Weiyang Liu, Yoshua Bengio, and Dinghuai Zhang. Efficient diversity- preserving diffusion alignment via gradient-informed GFlownets. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025. URL https://openreview.net/ forum?id=Aye5wL6TCn
2025
-
[28]
doi: 10.1007/s11633-025-1562-4
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, June 2025. ISSN 2731-5398. doi: 10.1007/s11633-025-1562-4. URL http://dx.doi.org/10.1007/s11633-025-1562-4
-
[29]
arXiv preprint arXiv:1705.07798 , year=
Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropy-regularized markov decision processes.CoRR, abs/1705.07798, 2017. URL http://arxiv.org/abs/ 1705.07798
-
[30]
Better training of gflownets with local credit and incomplete trajectories
Ling Pan, Nikolay Malkin, Dinghuai Zhang, and Yoshua Bengio. Better training of gflownets with local credit and incomplete trajectories. InInternational Conference on Machine Learning,
-
[31]
URLhttps://openreview.net/forum?id=beHp3L9KXc
-
[32]
Barron, and Ben Mildenhall
Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=FjNys5c7VyY
2023
-
[33]
Connecting stochastic optimal control and reinforcement learning.Journal of Mathematical Physics, 65(8), 2024
Jannes Quer and Enric Ribera Borrell. Connecting stochastic optimal control and reinforcement learning.Journal of Mathematical Physics, 65(8), 2024
2024
-
[34]
On the spectral bias of neural networks
Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5301–
-
[35]
URLhttps://proceedings.mlr.press/v97/rahaman19a
PMLR, 09–15 Jun 2019. URLhttps://proceedings.mlr.press/v97/rahaman19a. html
2019
-
[36]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022
2022
-
[37]
Laion aesthetics, Aug 2022
Chrisoph Schuhmann. Laion aesthetics, Aug 2022. URL https://laion.ai/blog/ laion-aesthetics/
2022
-
[38]
Trust region policy optimization
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 1889–1897, Lille, France, 07–09 Jul 2015. PMLR. URL https: //proceedings...
2015
-
[39]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=PxTIG12RRHS
2021
-
[40]
Charles M. Stein. Estimation of the mean of a multivariate normal distribution.The Annals of Statistics, 9(6):1135–1151, 1981
1981
-
[41]
Sutton and Andrew G
Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd. html
2018
-
[42]
Inference-time alignment of diffusion models with direct noise optimization
Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, and Tsung-Hui Chang. Inference-time alignment of diffusion models with direct noise optimization. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=JpbqiD7n9r. 14
2025
-
[43]
Masatoshi Uehara, Yulai Zhao, Tommaso Biancalani, and Sergey Levine. Understanding reinforcement learning-based fine-tuning of diffusion models: A tutorial and review, 2024. URL https://arxiv.org/abs/2407.13734
-
[44]
Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Tommaso Biancalani, and Sergey Levine. Fine- tuning of continuous-time diffusion models as entropy-regularized control.arXiv preprint arXiv:2402.15194, 2024
-
[45]
Bayesian learning via neural schrödinger-föllmer flows
Francisco Vargas, Andrius Ovsianas, David Lopes Fernandes, Mark Girolami, Neil D Lawrence, and Nikolas Nüsken. Bayesian learning via neural schrödinger-föllmer flows. InFourth Sym- posium on Advances in Approximate Bayesian Inference, 2022. URL https://openreview. net/forum?id=1Fqd10N5yTF
2022
-
[46]
Diffusion model alignment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8228–8238, June 2024
2024
-
[47]
GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping, 2025
Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, Meng Wang, Pengfei Wan, and Xiaodan Liang. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping, 2025. URL https://arxiv.org/abs/2510.22319
-
[48]
R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 8:229–256, 1992
1992
-
[49]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Lai. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, 2023. URLhttps://arxiv.org/abs/2306.09341
work page internal anchor Pith review arXiv 2023
-
[50]
Imagereward: learning and evaluating human preferences for text-to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 15903–15935, 2023
2023
-
[51]
arXiv preprint arXiv:2509.25050 , year=
Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models, 2025. URL https: //arxiv.org/abs/2509.25050
-
[52]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, and Ping Luo. Dancegrpo: Unleashing grpo on visual generation, 2025. URLhttps://arxiv.org/abs/2505.07818
work page internal anchor Pith review arXiv 2025
-
[53]
Susskind, Navdeep Jaitly, and Shuangfei Zhai
Dinghuai Zhang, Yizhe Zhang, Jiatao Gu, Ruixiang Zhang, Joshua M. Susskind, Navdeep Jaitly, and Shuangfei Zhai. Improving GFlownets for text-to-image diffusion alignment.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/ forum?id=XDbY3qhM42
2025
-
[54]
DiffusionNFT: Online diffusion reinforcement with forward process
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. DiffusionNFT: Online diffusion reinforcement with forward process. InThe Fourteenth International Conference on Learning Representations,
-
[55]
15 Table 2:Notation guide.Main symbols used in the unified RSM framework
URLhttps://openreview.net/forum?id=VJZ477R89F. 15 Table 2:Notation guide.Main symbols used in the unified RSM framework. Symbol Meaning Main role Ψ⋆ ti Optimal value guidance, 1 α ∇xti Vti Ideal reward-guided correction to the reference score ˆΨti Practical estimate ofΨ ⋆ ti Guidance estimate before temporal reweighting Ψti Effective guidance,γ(t i) ˆΨti ...
-
[56]
Z T 0 u(xt, t)⊤ g(t) dwt − 1 2 Z T 0 ∥ u(xt, t) g(t) ∥2dt # =E u,ν
For∀j, for almost everyu −j ∈R d−1,lim uj →±∞ p(uj,u −j |x) = 0. Then the conditional score has zero mean: Eu∼p(·|x) ∇u logp(u|x) =0.(42) Proof.For eachj, Ep(·|x)[∂uj logp(u|x)] = Z Rd p(u|x) ∂uj p(u|x) p(u|x) du= Z Rd ∂uj p(u|x)du, where the equality is valid wherever p(u|x)>0 , and condition 2 justifies the integral. Write u= (u j,u −j)and apply Fubini:...
2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.