Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
hub
arXiv:2310.03739, 2023
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.
EAM reformulates adjoint matching for diffusion fine-tuning with linear base drift to allow efficient deterministic sampling and closed-form adjoints while matching or exceeding prior performance.
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.
Derives RAM, a reward-adjusted consistency loss extending diffusion pretraining regression to efficient KL-regularized RL post-training, achieving peak rewards up to 50x faster than Flow-GRPO on Stable Diffusion 3.5M.
Adjoint matching objectives derived from the Stochastic Maximum Principle have critical points satisfying HJB stationarity conditions for SOC problems with control-dependent drift and diffusion.
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.
A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.
A literature survey that organizes diffusion model alignment methods along five axes (feedback source, reward form, optimization mechanism, distribution shift handling, and explicit safety constraints) and identifies open challenges for reliable deployment.
citing papers explorer
-
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
-
RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution
RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.
-
Efficient Adjoint Matching for Fine-tuning Diffusion Models
EAM reformulates adjoint matching for diffusion fine-tuning with linear base drift to allow efficient deterministic sampling and closed-form adjoints while matching or exceeding prior performance.
-
Personalizing Text-to-Image Generation to Individual Taste
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
-
Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling
DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.
-
Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
Derives RAM, a reward-adjusted consistency loss extending diffusion pretraining regression to efficient KL-regularized RL post-training, achieving peak rewards up to 50x faster than Flow-GRPO on Stable Diffusion 3.5M.
-
Adjoint Matching through the Lens of the Stochastic Maximum Principle in Optimal Control
Adjoint matching objectives derived from the Stochastic Maximum Principle have critical points satisfying HJB stationarity conditions for SOC problems with control-dependent drift and diffusion.
-
Improving Video Generation with Human Feedback
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.
-
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards
A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.
-
Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey
A literature survey that organizes diffusion model alignment methods along five axes (feedback source, reward form, optimization mechanism, distribution shift handling, and explicit safety constraints) and identifies open challenges for reliable deployment.