OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
hub
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
26 Pith papers cite this work. Polarity classification is still indexing.
abstract
Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO and DanceGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose $\textbf{MixGRPO}$, a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead, and allowing for more focused gradient updates to accelerate convergence. Additionally, as time-steps beyond the sliding window are not involved in optimization, higher-order solvers are supported for faster sampling. So we present a faster variant, termed $\textbf{MixGRPO-Flash}$, which further improves training efficiency while achieving comparable performance. MixGRPO exhibits substantial gains across multiple dimensions of human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency, with nearly 50% lower training time. Notably, MixGRPO-Flash further reduces training time by 71%.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
unclear 1representative citing papers
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.
OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.
Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Diffusion 3.5 Medium.
A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.
ParetoSlider conditions diffusion models on continuous preference weights to approximate the full Pareto front, providing dynamic control over multi-objective rewards at inference time.
OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy from 8% to 57%.
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
YingMusic-Singer-Plus is a diffusion model for singing voice synthesis that preserves melody from a reference clip while allowing flexible lyric changes without manual alignment, outperforming Vevo2 and introducing the LyricEditBench benchmark.
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world performance than prior methods.
POCA combines Pareto optimization with curriculum alignment to improve multi-reward reinforcement learning for visual text generation without relying on weighted sums.
V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
Reward Score Matching unifies reward-based fine-tuning for flow and diffusion models by recasting alignment as score matching to a value-guided target.
RC-GRPO-Editing constrains GRPO exploration to editing regions via localized noise and attention rewards, improving instruction adherence and non-target preservation in flow-based image editing.
MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
CellFluxRL post-trains the CellFlux generative model with reinforcement learning driven by biologically meaningful reward functions, yielding virtual cell images that better satisfy physical and biological constraints than the base model.
HunyuanVideo 1.5 delivers state-of-the-art open-source text-to-video and image-to-video generation with an 8.3B parameter DiT model featuring SSTA attention, glyph-aware encoding, and progressive training.
Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
citing papers explorer
-
OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
-
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.
-
OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
-
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Diffusion 3.5 Medium.
-
Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline
A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.
-
ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control
ParetoSlider conditions diffusion models on continuous preference weights to approximate the full Pareto front, providing dynamic control over multi-objective rewards at inference time.
-
Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation
OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
-
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
-
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy from 8% to 57%.
-
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
-
YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance
YingMusic-Singer-Plus is a diffusion model for singing voice synthesis that preserves melody from a reference clip while allowing flexible lyric changes without manual alignment, outperforming Vevo2 and introducing the LyricEditBench benchmark.
-
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
-
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
-
From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data
The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world performance than prior methods.
-
POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation
POCA combines Pareto optimization with curriculum alignment to improve multi-reward reinforcement learning for visual text generation without relying on weighted sums.
-
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
-
Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models
Reward Score Matching unifies reward-based fine-tuning for flow and diffusion models by recasting alignment as score matching to a value-guided target.
-
Region-Constrained Group Relative Policy Optimization for Flow-Based Image Editing
RC-GRPO-Editing constrains GRPO exploration to editing regions via localized noise and attention rewards, improving instruction adherence and non-target preservation in flow-based image editing.
-
MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation
MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
-
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
-
CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning
CellFluxRL post-trains the CellFlux generative model with reinforcement learning driven by biologically meaningful reward functions, yielding virtual cell images that better satisfy physical and biological constraints than the base model.
-
HunyuanVideo 1.5 Technical Report
HunyuanVideo 1.5 delivers state-of-the-art open-source text-to-video and image-to-video generation with an 8.3B parameter DiT model featuring SSTA attention, glyph-aware encoding, and progressive training.
-
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
-
A Systematic Post-Train Framework for Video Generation
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
-
Reward-Aware Trajectory Shaping for Few-step Visual Generation
RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.