Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
E²PO uses embedding-level perturbations to maintain intra-group variance and discriminative signal in RL-based preference optimization for generative flow models.
A literature survey that organizes diffusion model alignment methods along five axes (feedback source, reward form, optimization mechanism, distribution shift handling, and explicit safety constraints) and identifies open challenges for reliable deployment.
citing papers explorer
-
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
-
Explicit Critic Guidance for Aligning Diffusion Models
Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.
-
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
-
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
-
Embedding-perturbed Exploration Preference Optimization for Flow Models
E²PO uses embedding-level perturbations to maintain intra-group variance and discriminative signal in RL-based preference optimization for generative flow models.