{"total":17,"items":[{"citing_arxiv_id":"2606.27771","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NormGuard: Reward-Preserving Norm Constraints in Flow-Matching Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-06-26T06:56:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NormGuard adds a training-time hinge penalty on velocity norm inflation in flow-matching RL to improve MLLM-judged image quality and forensic realism while preserving reward across multiple setups.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27736","ref_index":83,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Explicit Critic Guidance for Aligning Diffusion Models","primary_cat":"cs.LG","submitted_at":"2026-05-26T22:20:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26552","ref_index":104,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference","primary_cat":"cs.LG","submitted_at":"2026-05-26T05:02:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FAV aligns few-step generative models by amortizing SVGD updates from reward-tilted sampling into generator parameters via fixed-point regression, requiring only sample access, and shows outperformance on robotics tasks plus scaling on image generators.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26108","ref_index":80,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reinforcing Few-step Generators via Reward-Tilted Distribution Matching","primary_cat":"cs.CV","submitted_at":"2026-05-25T17:59:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RTDMD unifies KL minimization to a reward-tilted teacher into distribution matching plus reward terms, using AC-DMD in stage one and hybrid GRPO-style gradients plus SubGRPO in stage two to reach new SOTA on preference, aesthetic, and compositional metrics with 4-step generation on SD3, SD3.5, and F","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26013","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models","primary_cat":"cs.LG","submitted_at":"2026-05-25T16:32:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AdvantageFlow proposes an advantage-weighted forward-process least-squares loss for RL in rectified flow models, stabilized by rollout policy regularization, and reports better image generation performance than Flow-GRPO on Stable Diffusion 3.5.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21573","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Lens is a 3.8B-parameter text-to-image model that reaches competitive or superior performance to >6B-parameter systems using 19.3% of the training compute of Z-Image through a densely captioned 800M dataset, multi-resolution batching, semantic VAE, strong language encoder, RL fine-tuning, and 4-step","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11480","ref_index":38,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Efficient Adjoint Matching for Fine-tuning Diffusion Models","primary_cat":"cs.LG","submitted_at":"2026-05-12T03:55:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EAM reformulates adjoint matching for diffusion fine-tuning with linear base drift to allow efficient deterministic sampling and closed-form adjoints while matching or exceeding prior performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tuning text-to-image diffusion models. InECCV, 2024. [36] X. Wu, K. Sun, F. Zhu, R. Zhao, and H. Li. Human preference score: Better aligning text-to-image models with human preference. InICCV, 2023. [37] J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation. InNeurIPS, 2023. [38] S. Xue, C. Ge, S. Zhang, Y . Li, and Z.-M. Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv:2509.25050, 2025. [39] Z. Xue, J. Wu, Y . Gao, F. Kong, L. Zhu, M. Chen, Z. Liu, W. Liu, Q. Guo, W. Huang, and P. Luo. DanceGRPO: Unleashing grpo on visual generation.arXiv:2505.07818, 2025. [40] H. Ye, K. Zheng, J. Xu, P."},{"citing_arxiv_id":"2605.10759","ref_index":11,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models","primary_cat":"cs.LG","submitted_at":"2026-05-11T15:56:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Derives RAM, a reward-adjusted consistency loss extending diffusion pretraining regression to efficient KL-regularized RL post-training, achieving peak rewards up to 50x faster than Flow-GRPO on Stable Diffusion 3.5M.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Random jump endpoint + analytic noising exact at init./optimum high variance Full-horizon w/ Bayes SDE rollout exact at init./optimum pathwise VJPs Full-horizon w/ Malliavin SDE rollout exact pathwise + score VJPs Local SDE rollout exact high variance Table 1Estimators for the value-function gradientA u t. Endpoint sampling with a random jump.We insert a uniformly random intermediate time s∈[0,t )into the analytic noising trajectory. Thent 2∥us(Xs)∥2 is an unbiased estimate of the integrated path cost. Combined with the Bayes bridge score from intermediate to training state, this yields the jump estimator ˆAjump t = ( r(X0)−t 2∥us(Xs)∥2) ∇xtlogps|t(Xs|xt) ⏐⏐ xt=Xt , s∼U[0,t).(21) The estimator is cheap and exact at initialization and at the optimum (Section C.1). The price is variance: the entire gradient signal flows through that one scalar prefactor. Pathwise differentiation along an SDE rollout.We simulate the controlled SDE (8) and recover the path-cost gradient by integrating an adjoint ODE backward along the stored trajectory. This is functionally equivalent to differentiating through the SDE solver in autograd, but more memory efficient. Adjoint integration over the full horizon is expensive and numerically delicate. Dynamic programming lets us replace some of the pathwise integration with a REINFORCE term: at any intermediate times, the adjoint splits into a REINFORCE term over the prefix[0,s]and a pathwise gradient over the suffix[s,t]. Theorem 5.1(Generalized adjoint).For every0≤s<t≤1and admissible controlu, ∇xVu t (x) =E [ Vu s (Xu s )∇xlogpu s|t(Xu s |x) ⏐⏐⏐Xu t =x ]    REINFORCE on prefix −1 2 E [ ∇x ∫ t s ∥uτ(Xu τ)∥2 dτ ⏐⏐⏐⏐Xu t =x ]    pathwise suffix .(22) The proof differentiates the recursion inxand applies the log-derivative identity to the prefix term (Section C."},{"citing_arxiv_id":"2604.25427","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Systematic Post-Train Framework for Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-28T09:34:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025. [29] Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025. [30] Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025. [31] Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, and Anyi Rao. Astrolabe: Steering forward-process reinforcement learning for"},{"citing_arxiv_id":"2604.23380","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think","primary_cat":"cs.LG","submitted_at":"2026-04-25T17:03:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"derperform on visual generation tasks. In this work, we revisit this simple approach and demonstrate that this lim- itation is not fundamental: a set of simple yet effective techniques unlocks its full potential, achieving state-of-the- art performance with significantly improved training effi- ciency. Concurrent with our work, Advantage Weighted Matching (AWM) [44] also explores ELBO-based surro- gates and demonstrates their underexplored potential, yet our work offers a more comprehensive study with stronger empirical validation. To circumvent likelihood approximation altogether, Dif- fusionNFT [47] foregoes standard policy gradient frame- work in favor of contrasting positive and negative poli- cies, achieving impressive results."},{"citing_arxiv_id":"2604.19009","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-21T02:57:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"nary differential equation (ODE) sampling into stochastic differential equation (SDE) formulations. This modification introduces exploratory noise that facili- GDMD 5 tates group-wise policy optimization. Subsequent works [11,12,21,22,51] have further refined GRPO-based frameworks to enhance both training efficiency and stability. Despite these advances, recent studies [23,55,59] have identified limita- tions inherent in policy optimization methods that rely on likelihood estimation, such as systematic bias and restricted solver flexibility. In response, Diffusion- NFT[59]proposesanovelapproachthatintegratesreinforcementsignalsdirectly into the standard diffusion training objective, bypassing the need for explicit likelihood estimation or SDE-based reverse processes."},{"citing_arxiv_id":"2604.17415","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models","primary_cat":"cs.LG","submitted_at":"2026-04-19T12:47:52+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15311","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories","primary_cat":"cs.CV","submitted_at":"2026-04-16T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"optimization (DPO) [37] for LLM post-training. They include Diffusion-DPO [46], D3PO [56], SPO [25], and others [2, 15, 16, 23, 45, 57, 59-61]. They fine-tune diffusion models using preference pairs or sets. For flow matching models, Adjoint Matching [5] formulates reward fine-tuning as stochastic optimal control, whereas DiffusionNFT [64] and AWM [54] propose forward-process RL methods. DanceGRPO [55] and Flow-GRPO [29] adapt GRPO [42] to flow matching by converting deterministic ODE sampling into an equivalent SDE formulation and applying the GRPO loss across generation steps. MixGRPO [22] and other GRPO variants [24, 47, 66] further improve efficiency and performance. Unlike the methods above, direct-gradient methods use the differentiability of diffusion and flow matching"},{"citing_arxiv_id":"2604.06916","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling","primary_cat":"cs.LG","submitted_at":"2026-04-08T10:14:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"gap: trajectories sampled by quantized policy exhibit an inherent distribution shift from the high-precision target policy, potentially disrupting delicate policy updates. Furthermore, for diffusion models post-training, the continuous nature of the state space exacerbates this degradation. Mainstream \"forward-process\" diffusion RL algorithms-especially Advantage Weighted Matching (AWM) [ 9], as well as DiffusionNFT [8]-formulate their objectives based on denoising score matching loss, treating the rollout samples as direct regression targets. As shown in Figure 3b, when corrupted by low-bit quantization (e.g., FP4), the numerical noise forces the high-precision policy to mimic distorted, low-fidelity semantics. Consequently, this naive substitution"},{"citing_arxiv_id":"2603.24936","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Guided Optimization","primary_cat":"cs.CV","submitted_at":"2026-03-26T01:59:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TIGFlow-GRPO uses a Trajectory-Interaction-Graph in conditional flow matching plus Flow-GRPO optimization to produce more accurate, socially compliant, and physically feasible trajectory forecasts on ETH/UCY and SDD datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.04663","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design","primary_cat":"cs.LG","submitted_at":"2026-02-04T15:36:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An ELBO-based likelihood estimator from the final generated sample dominates other RL design factors for diffusion models, raising GenEval from 0.24 to 0.95 in 90 GPU hours with better efficiency than prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.16888","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback","primary_cat":"cs.CV","submitted_at":"2025-10-19T15:38:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniWorld-V2 applies policy optimization via DiffusionNFT and MLLM logit feedback with group filtering to reach state-of-the-art scores of 4.49 on ImgEdit and 7.83 on GEdit-Bench while remaining model-agnostic.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}