Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
Pith reviewed 2026-05-18 20:27 UTC · model grok-4.3
The pith
By using win rates from pairwise comparisons, Pref-GRPO stabilizes reinforcement learning for text-to-image generation and reduces reward hacking.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that shifting the optimization from score maximization to preference fitting via win rates obtained from pairwise comparisons within groups using preference reward models yields more stable advantages and mitigates reward hacking in text-to-image reinforcement learning.
What carries the argument
Pairwise preference reward-based GRPO, in which the win rate from intra-group pairwise comparisons serves as the reward signal to replace normalized pointwise scores.
If this is right
- Training becomes more stable by avoiding the amplification of minimal score differences.
- The method better captures subtle differences in image quality.
- Reward hacking is reduced in GRPO-based approaches for T2I.
- UniGenBench enables more comprehensive evaluation of T2I models across multiple themes and criteria.
Where Pith is reading between the lines
- Similar pairwise preference approaches could improve stability in other areas of reinforcement learning from feedback.
- The new benchmark may highlight specific weaknesses in current T2I models that pointwise evaluations miss.
- Integrating preference fitting with other optimization techniques could lead to even better image generation results.
Load-bearing premise
That the preference reward models give accurate signals for subtle quality differences and that the main cause of prior instability is the normalization amplifying tiny score gaps.
What would settle it
A demonstration that Pref-GRPO exhibits reward hacking or instability when applied to certain preference models, or that the win rates fail to predict human preferences on image pairs.
Figures
read the original abstract
Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that pointwise reward model scores in GRPO for text-to-image generation lead to reward hacking because normalization amplifies minimal differences into unstable advantages. Pref-GRPO addresses this by using a preference reward model to perform pairwise comparisons within groups of images and deriving win rates as the reward signal, shifting the objective to preference fitting. The work also introduces UniGenBench, a benchmark with 600 prompts across 5 themes and 20 subthemes, using MLLM-based evaluation on 10 primary and 27 sub-criteria for semantic consistency. Experiments are said to show more stable training and better differentiation of subtle quality differences.
Significance. If the central empirical claims hold, Pref-GRPO would represent a targeted improvement to GRPO-style RL for T2I models by replacing a known source of advantage instability with pairwise preference signals. UniGenBench could fill a gap in coarse existing T2I evaluation by providing finer-grained, multi-criteria assessment. The approach builds directly on cited GRPO and preference RM components without introducing new free parameters in the core formulation.
major comments (3)
- [§3] §3 (Method): The claim that pairwise win rates from the preference RM yield stable advantages rests on the unverified assumption that the RM supplies unbiased signals for subtle differences; no inter-rater agreement, cycle detection, or adversarial robustness results for the RM are reported, which is load-bearing for the reward-hacking mitigation argument.
- [§4.2] §4.2 (Experiments): The reported stability improvements lack controls or ablations isolating the effect of win-rate normalization versus other factors such as policy gradient variance or prompt distribution; without these, attribution of gains to the pairwise formulation is not fully secured.
- [Table 3] Table 3 or equivalent results table: No standard deviations, number of random seeds, or statistical significance tests are provided for the metric gains over GRPO baselines, weakening the cross-method stability claim.
minor comments (3)
- [Abstract] Abstract: Inconsistent capitalization of 'Pref-GRPO' versus 'PREF-GRPO' should be unified.
- [§2] §2 (Related Work): GRPO acronym is used before its first expansion; add the full name on first occurrence.
- [Figure 2] Figure 2: The pipeline diagram would be clearer with explicit annotation of the within-group pairwise comparison step and win-rate aggregation.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below, acknowledging where the original manuscript could be strengthened and outlining planned revisions.
read point-by-point responses
-
Referee: [§3] §3 (Method): The claim that pairwise win rates from the preference RM yield stable advantages rests on the unverified assumption that the RM supplies unbiased signals for subtle differences; no inter-rater agreement, cycle detection, or adversarial robustness results for the RM are reported, which is load-bearing for the reward-hacking mitigation argument.
Authors: We agree that the reliability of the preference RM is central to the argument. The RM employed is a standard model pretrained on large-scale human preference datasets for T2I tasks, as cited in the manuscript. While the original submission did not report inter-rater agreement, cycle detection, or adversarial robustness tests specific to this RM, the pairwise formulation mitigates sensitivity to small absolute score variations by design. To address the concern, we will add a dedicated paragraph in §3 discussing the RM's established properties from prior literature and explicitly noting the absence of custom robustness experiments as a limitation. revision: yes
-
Referee: [§4.2] §4.2 (Experiments): The reported stability improvements lack controls or ablations isolating the effect of win-rate normalization versus other factors such as policy gradient variance or prompt distribution; without these, attribution of gains to the pairwise formulation is not fully secured.
Authors: The referee correctly identifies that stronger isolation of the win-rate component would improve attribution. Our experiments primarily compared full Pref-GRPO against GRPO baselines and showed improved stability, but did not include targeted ablations holding gradient variance and prompt distribution fixed while varying only the reward signal. We will incorporate additional ablation experiments in the revised §4.2 that systematically vary the reward formulation (pointwise vs. pairwise win-rate) under controlled conditions to better isolate its contribution. revision: yes
-
Referee: [Table 3] Table 3 or equivalent results table: No standard deviations, number of random seeds, or statistical significance tests are provided for the metric gains over GRPO baselines, weakening the cross-method stability claim.
Authors: We acknowledge that the lack of variability measures and statistical tests limits the strength of the stability claims. The original tables reported mean performance metrics without standard deviations, seed counts, or significance testing. In the revision we will update Table 3 and related result tables to include standard deviations computed over multiple random seeds (at least three) and add statistical significance indicators for key comparisons against the GRPO baseline. revision: yes
Circularity Check
No significant circularity: Pref-GRPO introduces independent pairwise reformulation on top of cited GRPO and preference RM components
full rationale
The paper's central move replaces pointwise RM scoring and normalization with within-group pairwise win rates from a preference RM as the reward signal. This is presented as a direct methodological substitution rather than a derivation that reduces by construction to fitted parameters, self-definitions, or prior self-citations. No equations or load-bearing steps in the provided abstract or description equate the output advantage stability to the input normalization step or to any self-referential loop. The claim of mitigating reward hacking rests on the new objective having independent content, consistent with the reader's assessment of score 2.0 for minor non-load-bearing elements only.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Preference reward models can reliably distinguish subtle quality differences between images in a manner superior to pointwise scoring after normalization.
Forward citations
Cited by 11 Pith papers
-
Efficient Adjoint Matching for Fine-tuning Diffusion Models
EAM speeds up adjoint matching for diffusion model reward fine-tuning by switching to linear base drift, allowing deterministic few-step solvers and closed-form adjoints with up to 4x faster convergence on text-to-ima...
-
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
TMPO replaces scalar reward maximization with trajectory-level matching to a Boltzmann distribution via Softmax-TB, improving generative diversity by 9.1% while keeping competitive reward performance.
-
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.
-
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
-
Self-Improving Tabular Language Models via Iterative Group Alignment
TabGRAA enables self-improving tabular language models through iterative group-relative advantage alignment using modular automated quality signals like distinguishability classifiers.
-
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
-
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
-
Leveraging Verifier-Based Reinforcement Learning in Image Editing
Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
-
Reward-Aware Trajectory Shaping for Few-step Visual Generation
RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
-
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards
A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.
Reference graph
Works this paper leans on
-
[1]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer
Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer. arXiv preprint arXiv:2505.22705,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025a. Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and C...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Cogview: Mastering text-to-image generation via transformers
Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. Cogview: Mastering text-to-image generation via transformers. arXiv preprint arXiv:2105.13290,
-
[5]
Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Tempflow-grpo: When timing matters for grpo in flow models
Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, and Bo Zhang. Tempflow-grpo: When timing matters for grpo in flow models. arXiv preprint arXiv:2508.04324,
- [8]
-
[9]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024a. 10 arXiv preprint Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow-based grpo ef...
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Flow Matching for Generative Modeling
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. arXiv preprint arXiv:2505.05470,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv preprint arXiv:2503.07265,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Delving into rl for image generation with cot: A study on dpo vs
Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, and Pheng-Ann Heng. Delving into rl for image generation with cot: A study on dpo vs. grpo. arXiv preprint arXiv:2505.17017,
-
[16]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Unified multimodal chain-of-thought reward model through reinforcement fine-tuning
Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning. arXiv preprint arXiv:2505.03318, 2025a. Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation. arXiv preprint arXi...
-
[18]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report. arXiv preprint arXiv:2508.02324, 2025a. Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding...
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. arXiv preprint arXiv:2506.15564,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Specifically, we observe that HPS tends to favor images with saturated colors
12 arXiv preprint A P REF -GRPO A.1 W HY PAIRWISE PREFERENCE -BASED REWARD WORKS This work finds that reward hacking is fundamentally caused by the model overly aligning with the reward model’s preferences. Specifically, we observe that HPS tends to favor images with saturated colors. However, when the model excessively optimizes this preference, reward h...
work page 2025
-
[22]
also discuss the issue of reward hacking, recognizing it as a pervasive challenge in the field. However, these methods typically attempt to alleviate the problem by adjusting experimental settings, such as incorporating the KL loss (Liu et al., 2025). In contrast, our work reveals that the underlying cause of reward hacking is the issue of illusory advant...
work page 2025
-
[23]
as the pointwise reward model. As shown in Fig. 8, the reward score increases sharply during training, yet the model quality begins to deteriorate around step 160, manifesting as over-saturated colors. Despite this degradation, the reward score continues to rise. This indicates the presence of theillusory advantage problem, where the model excessively opt...
work page 2021
-
[24]
B.7 S UPERIORITY OF UNIGENBENCH The superiority of UNIGENBENCH can be summarized as follows: • Comprehensive Dimension Evaluation : It spans 10 primary dimensions and 27 sub- dimensions, offering a systematic and in-depth assessment of a model’s capabilities across various aspects. To the best of our knowledge, this is the most comprehensive benchmark in ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.