{"total":11,"items":[{"citing_arxiv_id":"2605.23522","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models","primary_cat":"cs.LG","submitted_at":"2026-05-22T11:37:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Precise is a new SDE-consistent stochastic sampler that balances exploration and stability for RL post-training of flow-matching models via a novel posterior-mean approximation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15980","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization","primary_cat":"cs.CV","submitted_at":"2026-05-15T14:13:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Flash-GRPO introduces iso-temporal grouping and temporal gradient rectification to enable single-step GRPO training that outperforms full-trajectory methods on video diffusion alignment under low compute budgets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12112","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy","primary_cat":"cs.CV","submitted_at":"2026-05-12T13:29:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"RLHF for Flow Matching.RLHF aligns generative models with human preferences via reward signals [39]. Building on early progress for diffusion models [ 6, 19, 59, 75], recent work extends RLHF to flow matching by enabling stochastic rollouts and GRPO-style optimization [ 40, 73]. Follow-up studies improve efficiency [21, 26, 35], preference modeling [78], theory [60], and reward hacking [61]. Despite these advances, RLHF for flow models often suffers from diversity collapse, and its underlying mechanism remains unclear. Entropy in LLM Reinforcement Finetuning.Entropy is widely used to characterize exploration and predict downstream gains in RLVR [12, 25, 52]. Token-level analyses further show that high-entropy"},{"citing_arxiv_id":"2605.10983","ref_index":48,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment","primary_cat":"cs.LG","submitted_at":"2026-05-09T04:41:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07503","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-08T09:37:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Conference on Computer Vision and Pattern Recognition, pages 8228-8238, 2024. [39] T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. [40] F. Wang and Z. Yu. Coefficients-preserving sampling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952, 2025. [41] J. Wang, H. Yuan, D. Chen, Y . Zhang, X. Wang, and S. Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023. [42] Y . Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y . Wang, C. Yang, Y . He, J. Yu, P. Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer"},{"citing_arxiv_id":"2604.25427","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Systematic Post-Train Framework for Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-28T09:34:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Flow-GRPO [22] and DanceGRPO [ 13] incorporate GRPO-style policy optimization into flow- matching frameworks by reformulating deterministic ODE sampling as stochastic SDE processes, thereby introducing exploratory noise for group-based policy improvement. More recently, MixGRPO [23] introduced a hybrid ODE-SDE sampling strategy that enhances training efficiency without compromising generative quality. Concurrently, Flow-CPS [24] identified a critical limitation in the SDE sampling employed by Flow-GRPO and DanceGRPO, the inconsistent noise coefficients across timesteps, which results in residual noise accumulation and imprecise reward estimation. To mitigate this, Flow-CPS proposes a noise-consistent SDE sampling method that improves reward accuracy and accelerates GRPO convergence."},{"citing_arxiv_id":"2604.23380","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think","primary_cat":"cs.LG","submitted_at":"2026-04-25T17:03:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[37] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion mod- els.NeurIPS, 2021. 2, 3 [38] Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization.CVPR, 2024. 2, 5 [39] Feng Wang and Zihao Yu. Coefficients-preserving sam- pling for reinforcement learning with flow matching. arXiv:2509.05952, 2025. 2 [40] Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping."},{"citing_arxiv_id":"2604.10962","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching","primary_cat":"cs.RO","submitted_at":"2026-04-13T03:56:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on locomotion and manipulation benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"ScoRe-Flow: Complete Distributional Control for Flow Matching Step 4: Practical estimator via a learned FM velocity field.Flow Matching learns amarginalvector field, which (under the standard conditional-to-marginal construction) corresponds to the posterior average of per-sample conditional velocities. Concretely, for the linear path, the optimal marginal velocity satisfies vmarg(t,a t,s) =E[v ⋆ |a t,s].(22) Thus, with a learned FM velocity fieldv θ(t,a t,s)≈v marg(t,a t,s), we obtain the score estimator used in the main text: st(at)≈ tv θ(t,a t,s)−a t 1−t . (23) Asymptotic behavior near t→1 and stabilization.The prefactor (1−t) −1 in (23) implies ∥st(at)∥=O((1−t) −1) as t→1 (for bounded vθ). Therefore, the coefficient multiplying the score in the drift should decay as O(1−t) to keep the"},{"citing_arxiv_id":"2510.21583","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization","primary_cat":"cs.CV","submitted_at":"2025-10-24T15:50:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GCPO shifts RL policy optimization for flow matching from step-level to chunk-level grouping of consecutive denoising steps, reporting up to 43% relative gains over GRPO on T2I benchmarks and preference tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.16117","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DiffusionNFT: Online Diffusion Reinforcement with Forward Process","primary_cat":"cs.LG","submitted_at":"2025-09-19T16:09:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-process methods.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"s −(l(s, t)x t −m(s, t)v θ(xt, t))∥2 2 2n2(s, t) +C =− ∥m(s, t)vθ(xt, t)−m(s, t)v sg(θ)(xt, t) +n(s, t)ϵ(i)∥2 2 2n2(s, t) +C (20) sgemerges because the samplesx (1) s , . . . ,x(N) s are gradient-free. The gradient of the reverse-step log likelihood w.r.t.θcan be surprisingly reduced to a simple form: ∇θ logp θ(x(i) s |xt) =− m(s, t) n(s, t) ∇θ((ϵ(i))⊤vθ(xt, t))(21) and ∇θL(θ) = m(s, t) n(s, t) ∇θ \" 1 N NX i=1 (A(i)ϵ(i))⊤vθ(xt, t) # (22) Therefore, FlowGRPO essentially aligns the velocity field with theadvantage-weighted noise, while the choice of timesteps and sampler only influences the weighting m(s,t) n(s,t) across sampling steps. In the following, we show a further conclusion that FlowGRPO can be viewed asa gradient estimation"},{"citing_arxiv_id":"2507.21802","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE","primary_cat":"cs.AI","submitted_at":"2025-07-29T13:40:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"method that optimizes a strategic subset of denoising steps to drastically reduce computational overhead while ensur- ing a more focused and efficient optimization. Specifi- cally, we employ a mixed ODE-SDE strategy, applying SDE sampling to a denoising sub-interval and Ordinary Dif- ferential Equations (ODE) sampling to the rest. Recently, Coefficients-Preserving Sampling (CPS) [40] was proposed 1 arXiv:2507.21802v6 [cs.AI] 20 Mar 2026 Prompt: Three cows eating in a field with sea in background. DanceGRPODanceGRPODanceGRPOMixGRPO (Ours)DanceGRPO Figure 1. Comparison of our MixGRPO and DanceGRPO with varying denoising steps to be optimized. MixGRPO achieves higher performance with lower overhead. as a more principled alternative to standard SDE sampling."}],"limit":50,"offset":0}