hub Canonical reference

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

· 2025 · cs.CV · arXiv 2508.20751

Canonical reference. 86% of citing Pith papers cite this work as background.

15 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 15 citing papers arXiv PDF

abstract

Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7

citation-polarity summary

background 6 support 1

representative citing papers

ETCHR: Editing To Clarify and Harness Reasoning

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.

Efficient Adjoint Matching for Fine-tuning Diffusion Models

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

EAM reformulates adjoint matching for diffusion fine-tuning with linear base drift to allow efficient deterministic sampling and closed-form adjoints while matching or exceeding prior performance.

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.

HP-Edit: A Human-Preference Post-Training Framework for Image Editing

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.

Leveraging Verifier-Based Reinforcement Learning in Image Editing

cs.CV · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

cs.CV · 2025-10-19 · unverdicted · novelty 6.0

UniWorld-V2 applies policy optimization via DiffusionNFT and MLLM logit feedback with group filtering to reach state-of-the-art scores of 4.49 on ImgEdit and 7.83 on GEdit-Bench while remaining model-agnostic.

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

Lens is a 3.8B-parameter text-to-image model that reaches competitive or superior performance to >6B-parameter systems using 19.3% of the training compute of Z-Image through a densely captioned 800M dataset, multi-resolution batching, semantic VAE, strong language encoder, RL fine-tuning, and 4-step

Embedding-perturbed Exploration Preference Optimization for Flow Models

cs.CV · 2026-05-15 · unverdicted · novelty 5.0

E²PO uses embedding-level perturbations to maintain intra-group variance and discriminative signal in RL-based preference optimization for generative flow models.

Self-Improving Tabular Language Models via Iterative Reward-Guided Post-Training

cs.LG · 2026-04-21 · unverdicted · novelty 5.0

TabGRAA applies group-relative advantage alignment in an iterative reward-guided post-training loop to improve tabular language model generators on fidelity, utility, and privacy trade-offs across five benchmarks.

Reward-Aware Trajectory Shaping for Few-step Visual Generation

cs.CV · 2026-04-16 · unverdicted · novelty 5.0

RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards

cs.CV · 2025-12-01 · conditional · novelty 5.0

A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.

Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization

cs.CV · 2025-10-24 · unverdicted · novelty 5.0

GCPO shifts RL policy optimization for flow matching from step-level to chunk-level grouping of consecutive denoising steps, reporting up to 43% relative gains over GRPO on T2I benchmarks and preference tasks.

citing papers explorer

Showing 15 of 15 citing papers.

ETCHR: Editing To Clarify and Harness Reasoning cs.CV · 2026-05-22 · unverdicted · none · ref 32 · internal anchor
A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.
Efficient Adjoint Matching for Fine-tuning Diffusion Models cs.LG · 2026-05-12 · unverdicted · none · ref 34 · 2 links · internal anchor
EAM reformulates adjoint matching for diffusion fine-tuning with linear base drift to allow efficient deterministic sampling and closed-form adjoints while matching or exceeding prior performance.
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment cs.LG · 2026-05-09 · unverdicted · none · ref 20 · 2 links · internal anchor
TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.
HP-Edit: A Human-Preference Post-Training Framework for Image Editing cs.CV · 2026-04-21 · unverdicted · none · ref 46 · internal anchor
HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories cs.CV · 2026-04-16 · unverdicted · none · ref 47 · internal anchor
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping cs.CV · 2026-05-11 · unverdicted · none · ref 9 · internal anchor
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
Leveraging Verifier-Based Reinforcement Learning in Image Editing cs.CV · 2026-04-30 · unverdicted · none · ref 58 · 2 links · internal anchor
Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback cs.CV · 2025-10-19 · unverdicted · none · ref 18 · internal anchor
UniWorld-V2 applies policy optimization via DiffusionNFT and MLLM logit feedback with group filtering to reach state-of-the-art scores of 4.49 on ImgEdit and 7.83 on GEdit-Bench while remaining model-agnostic.
Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models cs.CV · 2026-05-20 · unverdicted · none · ref 55 · internal anchor
Lens is a 3.8B-parameter text-to-image model that reaches competitive or superior performance to >6B-parameter systems using 19.3% of the training compute of Z-Image through a densely captioned 800M dataset, multi-resolution batching, semantic VAE, strong language encoder, RL fine-tuning, and 4-step
Embedding-perturbed Exploration Preference Optimization for Flow Models cs.CV · 2026-05-15 · unverdicted · none · ref 80 · internal anchor
E²PO uses embedding-level perturbations to maintain intra-group variance and discriminative signal in RL-based preference optimization for generative flow models.
Self-Improving Tabular Language Models via Iterative Reward-Guided Post-Training cs.LG · 2026-04-21 · unverdicted · none · ref 219 · internal anchor
TabGRAA applies group-relative advantage alignment in an iterative reward-guided post-training loop to improve tabular language model generators on fidelity, utility, and privacy trade-offs across five benchmarks.
Reward-Aware Trajectory Shaping for Few-step Visual Generation cs.CV · 2026-04-16 · unverdicted · none · ref 41 · internal anchor
RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 187 · internal anchor
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards cs.CV · 2025-12-01 · conditional · none · ref 36 · internal anchor
A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.
Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization cs.CV · 2025-10-24 · unverdicted · none · ref 15 · internal anchor
GCPO shifts RL policy optimization for flow matching from step-level to chunk-level grouping of consecutive denoising steps, reporting up to 43% relative gains over GRPO on T2I benchmarks and preference tasks.

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer