OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
hub Canonical reference
Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
years
2026 15roles
background 6polarities
background 6representative citing papers
DRM turns a pre-trained diffusion model into a step-wise reward model and uses it for dense RL training (Step-wise GRPO) and guided sampling to improve final image quality.
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.
EAM reformulates adjoint matching for diffusion fine-tuning with linear base drift to allow efficient deterministic sampling and closed-form adjoints while matching or exceeding prior performance.
TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
An ELBO-based likelihood estimator from the final generated sample dominates other RL design factors for diffusion models, raising GenEval from 0.24 to 0.95 in 90 GPU hours with better efficiency than prior methods.
Precise is a new SDE-consistent stochastic sampler that balances exploration and stability for RL post-training of flow-matching models via a novel posterior-mean approximation.
E²PO uses embedding-level perturbations to maintain intra-group variance and discriminative signal in RL-based preference optimization for generative flow models.
Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
citing papers explorer
-
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.
-
Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design
An ELBO-based likelihood estimator from the final generated sample dominates other RL design factors for diffusion models, raising GenEval from 0.24 to 0.95 in 90 GPU hours with better efficiency than prior methods.