arxiv: 2604.23380 · v1 · submitted 2026-04-25 · 💻 cs.LG · cs.CV

Recognition: unknown

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

Bingda Tang , Yuhui Zhang , Xiaohan Wang , Jiayuan Mao , Ludwig Schmidt , Serena Yeung-Levy

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:23 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords reinforcement learningdenoising modelsELBOdiffusion modelstext-to-image synthesisvariance reductionpolicy optimizationGRPO

0 comments

The pith

By reducing variance in the ELBO surrogate and controlling gradient steps, online reinforcement learning for denoising generative models becomes stable, efficient, and superior to MDP-based alternatives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Denoising generative models such as diffusion models used in text-to-image tasks have intractable likelihoods, which blocks straightforward application of online RL for alignment with rewards or preferences. Earlier attempts either converted the problem into an MDP over full sampling trajectories, which proved stable but slow, or relied on ELBO-based likelihood surrogates that suffered from high variance and instability. The paper shows that simple variance reduction combined with controlled gradient steps turns the ELBO surrogate into a workable and efficient estimator. When this estimator is paired with the Group Relative Policy Optimization algorithm, the resulting V-GRPO method delivers higher performance than MDP baselines while cutting training time substantially. A sympathetic reader cares because the approach keeps the optimization close to the model's original pretraining objective and removes the need for expensive trajectory sampling.

Core claim

V-GRPO shows that an ELBO-based surrogate for policy gradients, once stabilized through variance reduction and step-size control, can be integrated with GRPO to produce stable online RL updates for denoising generative models; this yields state-of-the-art text-to-image results together with a 2x speedup over MixGRPO and a 3x speedup over DiffusionNFT.

What carries the argument

V-GRPO, the algorithm that pairs reduced-variance ELBO likelihood surrogates with Group Relative Policy Optimization and explicit gradient control to enable direct, stable policy updates on denoising models.

Load-bearing premise

The proposed variance reduction and gradient control steps will keep the ELBO surrogate stable across different model architectures and reward signals without introducing fresh instabilities or demanding heavy per-task tuning.

What would settle it

Apply V-GRPO to a standard text-to-image diffusion model, compare reward scores and wall-clock training time against MixGRPO and DiffusionNFT on the same benchmark; if the method fails to match the reported performance gains or speedups, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2604.23380 by Bingda Tang, Jiayuan Mao, Ludwig Schmidt, Serena Yeung-Levy, Xiaohan Wang, Yuhui Zhang.

**Figure 1.** Figure 1: Per-sample loss varies substantially across timesteps. Statistics are computed over 400 samples using Eq. (10). Shaded regions indicate ±1 standard deviation. that could fail to faithfully reflect the relative likelihoods. This problem is further compounded by the observation in view at source ↗

**Figure 2.** Figure 2: Gradient norms scale with surrogate magnitude. Statistics are computed over ∼20K samples using FLUX.1-dev. Gradient norms are computed by backpropagating through the importance ratios, without applying clipping or scaling by the advantages. Mean curves are truncated to the 1st-99th percentile range. Group-shared timestep-noise pairs. To reduce variance in the surrogate magnitude arising from random draws… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison from the FLUX.1-dev main experiments. V-GRPO achieves superior performance in alignment, coherence, and style. In the second example, it shows strong text rendering capabilities without leveraging task-specific rewards or datasets. As shown in Tab. 4, KL penalty plays a key role in preserving capabilities acquired during earlier training. However, on its own it is not always suffic… view at source ↗

**Figure 4.** Figure 4: Results for ablation studies. Moreover, V-GRPO achieves competitive performance in single-reward settings. Results are reported in Tab. 6. 5.4. Ablation Studies Reducing surrogate variance. Fig. 4a ablates the proposed surrogate variance reduction techniques on FLUX.1- dev [16]. Without these methods, the naive baseline suffers from severe training instability. Removing either groupshared timestep-noise … view at source ↗

**Figure 5.** Figure 5: Ablation studies of surrogate variance reduction techniques. Implementation details follow those of Stage-1 training in the SD 3.5 main experiments. 0 50 100 150 200 250 300 Training Iteration 1.30 1.35 1.40 1.45 1.50 1.55 1.60 Mean Reward x-prediction -prediction v-prediction view at source ↗

**Figure 6.** Figure 6: Ablation studies of alternative reparameterizations of model predictions. Implementation details follow those used in the FLUX.1-dev experiments view at source ↗

**Figure 7.** Figure 7: Qualitative comparison from the FLUX.1-dev main experiments. V-GRPO achieves superior performance in alignment, coherence, and style. In the fourth example, it demonstrates strong world knowledge view at source ↗

**Figure 8.** Figure 8: Qualitative comparison from the SD 3.5 M main experiments. V-GRPO achieves superior performance in alignment, coherence, and style view at source ↗

read the original abstract

Aligning denoising generative models with human preferences or verifiable rewards remains a key challenge. While policy-gradient online reinforcement learning (RL) offers a principled post-training framework, its direct application is hindered by the intractable likelihoods of these models. Prior work therefore either optimizes an induced Markov decision process (MDP) over sampling trajectories, which is stable but inefficient, or uses likelihood surrogates based on the diffusion evidence lower bound (ELBO), which have so far underperformed on visual generation. Our key insight is that the ELBO-based approach can, in fact, be made both stable and efficient. By reducing surrogate variance and controlling gradient steps, we show that this approach can beat MDP-based methods. To this end, we introduce Variational GRPO (V-GRPO), a method that integrates ELBO-based surrogates with the Group Relative Policy Optimization (GRPO) algorithm, alongside a set of simple yet essential techniques. Our method is easy to implement, aligns with pretraining objectives, and avoids the limitations of MDP-based methods. V-GRPO achieves state-of-the-art performance in text-to-image synthesis, while delivering a $2\times$ speedup over MixGRPO and a $3\times$ speedup over DiffusionNFT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces V-GRPO, which integrates ELBO-based likelihood surrogates into the Group Relative Policy Optimization (GRPO) algorithm for online RL on denoising generative models. The central technical contribution is a set of variance-reduction and gradient-step-control techniques derived from the diffusion ELBO that stabilize training; the authors claim this yields both stability and efficiency gains, outperforming MDP-based baselines (MixGRPO, DiffusionNFT) on text-to-image synthesis while delivering 2× and 3× speedups under matched compute budgets.

Significance. If the reported results hold, the work is significant because it demonstrates that a properly stabilized ELBO surrogate can surpass the stability of MDP formulations without incurring their trajectory-sampling overhead, while remaining aligned with pretraining objectives and easy to implement. The manuscript supplies ablation tables that isolate the contribution of each control, conducts experiments with multiple seeds and fixed hyperparameters across runs, and measures speedups under matched compute budgets; these are concrete strengths that support the empirical claims.

minor comments (3)

[Abstract] Abstract: the claim of 'state-of-the-art performance' is not accompanied by the specific metrics (e.g., FID, CLIP score) or the full set of baselines against which superiority is asserted; adding this information would strengthen the abstract.
[§4] §4 (Experiments): while the text states that speedups are measured under matched compute budgets, the precise accounting (wall-clock time, number of denoising steps, hardware) is not tabulated; a small table or paragraph clarifying this would improve reproducibility.
[§3.2] Notation in §3.2: the variance-reduction term is introduced without an explicit equation reference back to the ELBO derivation; a single cross-reference would clarify the connection for readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work, as well as the recommendation for minor revision. The referee correctly identifies the core technical contribution of V-GRPO in stabilizing ELBO-based surrogates within GRPO for online RL on denoising models, and we appreciate the recognition of our empirical strengths including ablations, multi-seed experiments, and matched-compute speedups.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation starts from the standard diffusion ELBO and GRPO objective, then introduces explicit variance-reduction and per-step gradient controls that are algebraically derived from those starting points rather than fitted to the target metric. Ablation tables isolate each control's contribution, and reported speedups are measured on matched compute budgets against external MDP baselines. No step reduces a claimed prediction to a quantity defined by the method itself, nor does any load-bearing claim rest on a self-citation whose validity is presupposed. The argument therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method appears to rely on standard diffusion and RL assumptions whose details are not visible here.

pith-pipeline@v0.9.0 · 5540 in / 981 out tokens · 43818 ms · 2026-05-08T08:23:08.145041+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 28 canonical work pages · 19 internal anchors

[1]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning.arXiv:2305.13301, 2023. 1, 2, 3, 4

work page internal anchor Pith review arXiv 2023
[2]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv:2507.05595, 2025. 1, 7, 8, 11

work page internal anchor Pith review arXiv 2025
[3]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. ICML, 2024. 7, 8, 12

2024
[4]

Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362,

Ying Fan and Kangwook Lee. Optimizing ddpm sampling with shortcut fine-tuning.arXiv:2301.13362, 2023. 1, 2

work page arXiv 2023
[5]

Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models.NeurIPS, 2023

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models.NeurIPS, 2023. 1, 2, 3

2023
[6]

Geneval: An object-focused framework for evaluating text- to-image alignment.NeurIPS, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.NeurIPS, 2023. 1, 6, 7, 8, 11

2023
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentiviz- ing reasoning capability in llms via reinforcement learning. arXiv:2501.12948, 2025. 1, 3

work page internal anchor Pith review arXiv 2025
[8]

Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324,

Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, and Bo Zhang. Tempflow- grpo: When timing matters for grpo in flow models. arXiv:2508.04324, 2025. 2

work page arXiv 2025
[9]

Clipscore: A reference-free evaluation met- ric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InEMNLP, 2021. 7, 11

2021
[10]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv:2207.12598, 2022. 7

work page internal anchor Pith review arXiv 2022
[11]

Denoising diffu- sion probabilistic models.NeurIPS, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.NeurIPS, 2020. 1, 2, 3

2020
[12]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 2022. 11

2022
[13]

Understanding diffu- sion objectives as the elbo with simple data augmentation

Diederik Kingma and Ruiqi Gao. Understanding diffu- sion objectives as the elbo with simple data augmentation. NeurIPS, 2023. 3, 4

2023
[14]

Variational diffusion models.NeurIPS, 2021

Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.NeurIPS, 2021. 2, 3

2021
[15]

Pick-a-pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. NeurIPS, 2023. 1, 7, 11

2023
[16]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 6, 7, 8, 12

2024
[17]

The Principles of Diffusion Models,

Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, and Stefano Ermon. The principles of diffusion models. arXiv:2510.21890, 2025. 4

work page arXiv 2025
[18]

Aligning Text-to-Image Models using Human Feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to- image models using human feedback.arXiv:2302.12192,

work page internal anchor Pith review arXiv
[19]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Un- locking flow-based grpo efficiency with mixed ode-sde. arXiv:2507.21802, 2025. 1, 2, 3, 6, 7, 8, 11

work page internal anchor Pith review arXiv 2025
[20]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025. 2

work page internal anchor Pith review arXiv 2025
[21]

arXiv preprint arXiv:2509.06040 (2025) 2, 3

Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models. arXiv:2509.06040, 2025. 1, 2, 3, 6, 7

work page arXiv 2025
[22]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv:2210.02747, 2022. 1, 2

work page internal anchor Pith review arXiv 2022
[23]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv:2505.05470, 2025. 1, 2, 3, 7, 11

work page internal anchor Pith review arXiv 2025
[24]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv:2209.03003, 2022. 2, 4, 7

work page internal anchor Pith review arXiv 2022
[25]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv:1711.05101, 2017. 11

work page internal anchor Pith review arXiv 2017
[26]

Dpm-solver++: Fast solver for guided sam- pling of diffusion probabilistic models.Machine Intelligence Research, pages 1–22, 2025

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sam- pling of diffusion probabilistic models.Machine Intelligence Research, pages 1–22, 2025. 7

2025
[27]

Flow matching policy gradients.arXiv preprint arXiv:2507.21053,

David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients. arXiv:2507.21053, 2025. 2, 4

work page arXiv 2025
[28]

Stochastic differential equations

Bernt Øksendal. Stochastic differential equations. In Stochastic differential equations: an introduction with ap- plications. Springer, 2003. 4

2003
[29]

Direct preference optimization: Your language model is secretly a reward model.NeurIPS, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.NeurIPS, 2023. 2

2023
[30]

Photorealistic text-to-image diffusion models with deep language understanding.NeurIPS, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.NeurIPS, 2022. 7

2022
[31]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv:2202.00512, 2022. 2

work page internal anchor Pith review arXiv 2022
[32]

Laion-aesthetics.https : / / laion.ai/blog/laion-aesthetics/, 2022

Christoph Schuhmann. Laion-aesthetics.https : / / laion.ai/blog/laion-aesthetics/, 2022. 7, 11

2022
[33]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv:1707.06347, 2017. 1, 3, 5

work page internal anchor Pith review arXiv 2017
[34]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx- iao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the lim- its of mathematical reasoning in open language models. arXiv:2402.03300, 2024. 1, 3

work page internal anchor Pith review arXiv 2024
[35]

Deep unsupervised learning using nonequilibrium thermodynamics.ICML, 2015

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics.ICML, 2015. 1, 2, 3

2015
[36]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv:2011.13456, 2020. 2

work page internal anchor Pith review arXiv 2011
[37]

Maximum likelihood training of score-based diffusion mod- els.NeurIPS, 2021

Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion mod- els.NeurIPS, 2021. 2, 3

2021
[38]

Diffusion model align- ment using direct preference optimization.CVPR, 2024

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization.CVPR, 2024. 2, 5

2024
[39]

Coefficients-preserving sampling for reinforcement learning with flow matching

Feng Wang and Zihao Yu. Coefficients-preserving sam- pling for reinforcement learning with flow matching. arXiv:2509.05952, 2025. 2

work page arXiv 2025
[40]

GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping, 2025

Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping. arXiv:2510.22319, 2025. 2

work page arXiv 2025
[41]

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv:2503.05236, 2025. 1, 7, 11

work page internal anchor Pith review arXiv 2025
[42]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv:2306.09341, 2023. 7, 11

work page internal anchor Pith review arXiv 2023
[43]

Imagere- ward: Learning and evaluating human preferences for text- to-image generation.NeurIPS, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.NeurIPS, 2023. 1, 7, 11

2023
[44]

arXiv preprint arXiv:2509.25050 , year=

Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv:2509.25050,

work page arXiv
[45]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv:2505.07818, 2025. 1, 2, 3, 7, 8

work page internal anchor Pith review arXiv 2025
[46]

One-step diffusion with distribution matching distillation

Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, 2024. 5

2024
[47]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion rein- forcement with forward process.arXiv:2509.16117, 2025. 2, 5, 7, 11 A. Additional Implementation Details Our implementation adheres closely to the baseline meth- ods, with deviations limited ...

work page internal anchor Pith review arXiv 2025
[48]

Shaun the Sheep

and the AdamW [25] optimizer with a learning rate of 3×10 −4 and a weight decay of1×10 −4. Across all stages, training is conducted with a global batch size of 48 per gra- dient step and a group size of 24, matching the per-step con- figuration of DiffusionNFT [47]. For consistency with DiffusionNFT, we prioritize us- ing fully on-policy training with adv...

work page arXiv 2051