Recognition: unknown
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
Pith reviewed 2026-05-08 08:23 UTC · model grok-4.3
The pith
By reducing variance in the ELBO surrogate and controlling gradient steps, online reinforcement learning for denoising generative models becomes stable, efficient, and superior to MDP-based alternatives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
V-GRPO shows that an ELBO-based surrogate for policy gradients, once stabilized through variance reduction and step-size control, can be integrated with GRPO to produce stable online RL updates for denoising generative models; this yields state-of-the-art text-to-image results together with a 2x speedup over MixGRPO and a 3x speedup over DiffusionNFT.
What carries the argument
V-GRPO, the algorithm that pairs reduced-variance ELBO likelihood surrogates with Group Relative Policy Optimization and explicit gradient control to enable direct, stable policy updates on denoising models.
Load-bearing premise
The proposed variance reduction and gradient control steps will keep the ELBO surrogate stable across different model architectures and reward signals without introducing fresh instabilities or demanding heavy per-task tuning.
What would settle it
Apply V-GRPO to a standard text-to-image diffusion model, compare reward scores and wall-clock training time against MixGRPO and DiffusionNFT on the same benchmark; if the method fails to match the reported performance gains or speedups, the central claim does not hold.
Figures
read the original abstract
Aligning denoising generative models with human preferences or verifiable rewards remains a key challenge. While policy-gradient online reinforcement learning (RL) offers a principled post-training framework, its direct application is hindered by the intractable likelihoods of these models. Prior work therefore either optimizes an induced Markov decision process (MDP) over sampling trajectories, which is stable but inefficient, or uses likelihood surrogates based on the diffusion evidence lower bound (ELBO), which have so far underperformed on visual generation. Our key insight is that the ELBO-based approach can, in fact, be made both stable and efficient. By reducing surrogate variance and controlling gradient steps, we show that this approach can beat MDP-based methods. To this end, we introduce Variational GRPO (V-GRPO), a method that integrates ELBO-based surrogates with the Group Relative Policy Optimization (GRPO) algorithm, alongside a set of simple yet essential techniques. Our method is easy to implement, aligns with pretraining objectives, and avoids the limitations of MDP-based methods. V-GRPO achieves state-of-the-art performance in text-to-image synthesis, while delivering a $2\times$ speedup over MixGRPO and a $3\times$ speedup over DiffusionNFT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces V-GRPO, which integrates ELBO-based likelihood surrogates into the Group Relative Policy Optimization (GRPO) algorithm for online RL on denoising generative models. The central technical contribution is a set of variance-reduction and gradient-step-control techniques derived from the diffusion ELBO that stabilize training; the authors claim this yields both stability and efficiency gains, outperforming MDP-based baselines (MixGRPO, DiffusionNFT) on text-to-image synthesis while delivering 2× and 3× speedups under matched compute budgets.
Significance. If the reported results hold, the work is significant because it demonstrates that a properly stabilized ELBO surrogate can surpass the stability of MDP formulations without incurring their trajectory-sampling overhead, while remaining aligned with pretraining objectives and easy to implement. The manuscript supplies ablation tables that isolate the contribution of each control, conducts experiments with multiple seeds and fixed hyperparameters across runs, and measures speedups under matched compute budgets; these are concrete strengths that support the empirical claims.
minor comments (3)
- [Abstract] Abstract: the claim of 'state-of-the-art performance' is not accompanied by the specific metrics (e.g., FID, CLIP score) or the full set of baselines against which superiority is asserted; adding this information would strengthen the abstract.
- [§4] §4 (Experiments): while the text states that speedups are measured under matched compute budgets, the precise accounting (wall-clock time, number of denoising steps, hardware) is not tabulated; a small table or paragraph clarifying this would improve reproducibility.
- [§3.2] Notation in §3.2: the variance-reduction term is introduced without an explicit equation reference back to the ELBO derivation; a single cross-reference would clarify the connection for readers.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our work, as well as the recommendation for minor revision. The referee correctly identifies the core technical contribution of V-GRPO in stabilizing ELBO-based surrogates within GRPO for online RL on denoising models, and we appreciate the recognition of our empirical strengths including ablations, multi-seed experiments, and matched-compute speedups.
Circularity Check
No significant circularity detected
full rationale
The derivation starts from the standard diffusion ELBO and GRPO objective, then introduces explicit variance-reduction and per-step gradient controls that are algebraically derived from those starting points rather than fitted to the target metric. Ablation tables isolate each control's contribution, and reported speedups are measured on matched compute budgets against external MDP baselines. No step reduces a claimed prediction to a quantity defined by the method itself, nor does any load-bearing claim rest on a self-citation whose validity is presupposed. The argument therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning.arXiv:2305.13301, 2023. 1, 2, 3, 4
work page internal anchor Pith review arXiv 2023
-
[2]
PaddleOCR 3.0 Technical Report
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv:2507.05595, 2025. 1, 7, 8, 11
work page internal anchor Pith review arXiv 2025
-
[3]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. ICML, 2024. 7, 8, 12
2024
-
[4]
Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362,
Ying Fan and Kangwook Lee. Optimizing ddpm sampling with shortcut fine-tuning.arXiv:2301.13362, 2023. 1, 2
-
[5]
Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models.NeurIPS, 2023
Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models.NeurIPS, 2023. 1, 2, 3
2023
-
[6]
Geneval: An object-focused framework for evaluating text- to-image alignment.NeurIPS, 2023
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.NeurIPS, 2023. 1, 6, 7, 8, 11
2023
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentiviz- ing reasoning capability in llms via reinforcement learning. arXiv:2501.12948, 2025. 1, 3
work page internal anchor Pith review arXiv 2025
-
[8]
Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324,
Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, and Bo Zhang. Tempflow- grpo: When timing matters for grpo in flow models. arXiv:2508.04324, 2025. 2
-
[9]
Clipscore: A reference-free evaluation met- ric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InEMNLP, 2021. 7, 11
2021
-
[10]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv:2207.12598, 2022. 7
work page internal anchor Pith review arXiv 2022
-
[11]
Denoising diffu- sion probabilistic models.NeurIPS, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.NeurIPS, 2020. 1, 2, 3
2020
-
[12]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 2022. 11
2022
-
[13]
Understanding diffu- sion objectives as the elbo with simple data augmentation
Diederik Kingma and Ruiqi Gao. Understanding diffu- sion objectives as the elbo with simple data augmentation. NeurIPS, 2023. 3, 4
2023
-
[14]
Variational diffusion models.NeurIPS, 2021
Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.NeurIPS, 2021. 2, 3
2021
-
[15]
Pick-a-pic: An open dataset of user preferences for text-to-image generation
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. NeurIPS, 2023. 1, 7, 11
2023
-
[16]
Flux.https://github.com/ black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 6, 7, 8, 12
2024
-
[17]
The Principles of Diffusion Models,
Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, and Stefano Ermon. The principles of diffusion models. arXiv:2510.21890, 2025. 4
-
[18]
Aligning Text-to-Image Models using Human Feedback
Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to- image models using human feedback.arXiv:2302.12192,
work page internal anchor Pith review arXiv
-
[19]
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Un- locking flow-based grpo efficiency with mixed ode-sde. arXiv:2507.21802, 2025. 1, 2, 3, 6, 7, 8, 11
work page internal anchor Pith review arXiv 2025
-
[20]
Back to Basics: Let Denoising Generative Models Denoise
Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[21]
arXiv preprint arXiv:2509.06040 (2025) 2, 3
Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models. arXiv:2509.06040, 2025. 1, 2, 3, 6, 7
-
[22]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv:2210.02747, 2022. 1, 2
work page internal anchor Pith review arXiv 2022
-
[23]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv:2505.05470, 2025. 1, 2, 3, 7, 11
work page internal anchor Pith review arXiv 2025
-
[24]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv:2209.03003, 2022. 2, 4, 7
work page internal anchor Pith review arXiv 2022
-
[25]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv:1711.05101, 2017. 11
work page internal anchor Pith review arXiv 2017
-
[26]
Dpm-solver++: Fast solver for guided sam- pling of diffusion probabilistic models.Machine Intelligence Research, pages 1–22, 2025
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sam- pling of diffusion probabilistic models.Machine Intelligence Research, pages 1–22, 2025. 7
2025
-
[27]
Flow matching policy gradients.arXiv preprint arXiv:2507.21053,
David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients. arXiv:2507.21053, 2025. 2, 4
-
[28]
Stochastic differential equations
Bernt Øksendal. Stochastic differential equations. In Stochastic differential equations: an introduction with ap- plications. Springer, 2003. 4
2003
-
[29]
Direct preference optimization: Your language model is secretly a reward model.NeurIPS, 2023
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.NeurIPS, 2023. 2
2023
-
[30]
Photorealistic text-to-image diffusion models with deep language understanding.NeurIPS, 2022
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.NeurIPS, 2022. 7
2022
-
[31]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv:2202.00512, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[32]
Laion-aesthetics.https : / / laion.ai/blog/laion-aesthetics/, 2022
Christoph Schuhmann. Laion-aesthetics.https : / / laion.ai/blog/laion-aesthetics/, 2022. 7, 11
2022
-
[33]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv:1707.06347, 2017. 1, 3, 5
work page internal anchor Pith review arXiv 2017
-
[34]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx- iao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the lim- its of mathematical reasoning in open language models. arXiv:2402.03300, 2024. 1, 3
work page internal anchor Pith review arXiv 2024
-
[35]
Deep unsupervised learning using nonequilibrium thermodynamics.ICML, 2015
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics.ICML, 2015. 1, 2, 3
2015
-
[36]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv:2011.13456, 2020. 2
work page internal anchor Pith review arXiv 2011
-
[37]
Maximum likelihood training of score-based diffusion mod- els.NeurIPS, 2021
Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion mod- els.NeurIPS, 2021. 2, 3
2021
-
[38]
Diffusion model align- ment using direct preference optimization.CVPR, 2024
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization.CVPR, 2024. 2, 5
2024
-
[39]
Coefficients-preserving sampling for reinforcement learning with flow matching
Feng Wang and Zihao Yu. Coefficients-preserving sam- pling for reinforcement learning with flow matching. arXiv:2509.05952, 2025. 2
-
[40]
GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping, 2025
Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping. arXiv:2510.22319, 2025. 2
-
[41]
Unified Reward Model for Multimodal Understanding and Generation
Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv:2503.05236, 2025. 1, 7, 11
work page internal anchor Pith review arXiv 2025
-
[42]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv:2306.09341, 2023. 7, 11
work page internal anchor Pith review arXiv 2023
-
[43]
Imagere- ward: Learning and evaluating human preferences for text- to-image generation.NeurIPS, 2023
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.NeurIPS, 2023. 1, 7, 11
2023
-
[44]
arXiv preprint arXiv:2509.25050 , year=
Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv:2509.25050,
-
[45]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv:2505.07818, 2025. 1, 2, 3, 7, 8
work page internal anchor Pith review arXiv 2025
-
[46]
One-step diffusion with distribution matching distillation
Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, 2024. 5
2024
-
[47]
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion rein- forcement with forward process.arXiv:2509.16117, 2025. 2, 5, 7, 11 A. Additional Implementation Details Our implementation adheres closely to the baseline meth- ods, with deviations limited ...
work page internal anchor Pith review arXiv 2025
-
[48]
and the AdamW [25] optimizer with a learning rate of 3×10 −4 and a weight decay of1×10 −4. Across all stages, training is conducted with a global batch size of 48 per gra- dient step and a group size of 24, matching the per-step con- figuration of DiffusionNFT [47]. For consistency with DiffusionNFT, we prioritize us- ing fully on-policy training with adv...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.