pith. the verified trust layer for science. sign in

arxiv: 2508.20751 · v2 · submitted 2025-08-28 · 💻 cs.CV

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Pith reviewed 2026-05-18 20:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords Pref-GRPOtext-to-image generationreinforcement learningreward hackingpairwise preferenceGRPOUniGenBench
0
0 comments X p. Extension

The pith

By using win rates from pairwise comparisons, Pref-GRPO stabilizes reinforcement learning for text-to-image generation and reduces reward hacking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors aim to solve the problem of reward hacking and training instability in GRPO methods for text-to-image generation. They show that pointwise reward models lead to illusory advantages when small score differences are normalized and amplified. Pref-GRPO addresses this by performing pairwise comparisons of images within each group and using the win rate as the reward signal instead. This changes the goal to fitting preferences rather than maximizing scores. They support this with experiments and introduce UniGenBench, a detailed benchmark using MLLM for evaluating semantic consistency in generated images.

Core claim

The central claim is that shifting the optimization from score maximization to preference fitting via win rates obtained from pairwise comparisons within groups using preference reward models yields more stable advantages and mitigates reward hacking in text-to-image reinforcement learning.

What carries the argument

Pairwise preference reward-based GRPO, in which the win rate from intra-group pairwise comparisons serves as the reward signal to replace normalized pointwise scores.

If this is right

  • Training becomes more stable by avoiding the amplification of minimal score differences.
  • The method better captures subtle differences in image quality.
  • Reward hacking is reduced in GRPO-based approaches for T2I.
  • UniGenBench enables more comprehensive evaluation of T2I models across multiple themes and criteria.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar pairwise preference approaches could improve stability in other areas of reinforcement learning from feedback.
  • The new benchmark may highlight specific weaknesses in current T2I models that pointwise evaluations miss.
  • Integrating preference fitting with other optimization techniques could lead to even better image generation results.

Load-bearing premise

That the preference reward models give accurate signals for subtle quality differences and that the main cause of prior instability is the normalization amplifying tiny score gaps.

What would settle it

A demonstration that Pref-GRPO exhibits reward hacking or instability when applied to certain preference models, or that the win rates fail to predict human preferences on image pairs.

Figures

Figures reproduced from arXiv: 2508.20751 by Cheng Jin, Chunyu Wang, Jiaqi Wang, Jiazi Bu, Qinglin Lu, Yibin Wang, Yuhang Zang, Yujie Zhou, Zhimin Li.

Figure 1
Figure 1. Figure 1: Method Overview. (a) Existing pointwise reward functions assign minimal score differ￾ences between generated images, which result in illusory advantage and ultimately lead to reward hacking. (b) PREF-GRPO shifts the training focus from reward score maximization to pairwise preference fitting, enabling stable optimization for T2I generation. ABSTRACT Recent advancements underscore the significant role of GR… view at source ↗
Figure 2
Figure 2. Figure 2: Reward Hacking Visualization. the small gaps are disproportionately amplified. Un￾der a reward-maximization objective, such inflated advantages drive the policy to over-optimize for triv￾ial reward cues, and this sustained pressure ultimately steers it toward reward-hacking behaviors that rapidly increase scores but destabilize the generation process (examples shown in Figs. 1 and 2). Besides, if the re￾wa… view at source ↗
Figure 3
Figure 3. Figure 3: Benchmark Statistics and Evaluation Results. This figure presents (a) prompt themes, (b) subject distribution, and evaluation dimensions (testpoints) of UNIGENBENCH, along with benchmarking results for both open-source and closed-source T2I models. Style World Know. Attribute Quant. Expn. Material Size Shape Color Action Hand Full Body Animal Non Contact --- --- Contact State Relationship Compo. Similarity… view at source ↗
Figure 4
Figure 4. Figure 4: Benchmark Comparison. While current methods only support scoring at the primary dimensions, our benchmark provides fine-grained evaluation across both primary and sub dimensions. (1) PREF-GRPO incorporates a pairwise preference RM (PPRM) (Wang et al., 2025a), reformulating the GRPO optimization objective from conventional absolute reward score maximization to pairwise preference fitting. As illustrated in … view at source ↗
Figure 5
Figure 5. Figure 5: UNIGENBENCH Construction and Evaluation Pipeline. We leverage powerful MLLM for (a) large-scale and diverse prompts generation, and (b) scalable and fine-grained T2I evaluation. 4.2 BENCHMARK CONSTRUCTION AND EVALUATION PIPELINE Having established diverse prompt themes, subject categories, and evaluation dimensions, we further construct an MLLM-based automated pipeline to operationalize the benchmark shown… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Comparison. We compare PREF-GRPO with several pointwise RM-based GRPO methods, demonstrating its superior performance and effectiveness [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Results of UnifiedReward Score-based Winrate as Reward: We convert UnifiedReward scores into win rates as rewards for GRPO, and observe that this effectively mitigates the reward hacking issue. A.2 POINT SCORE-BASED WINRATE V.S. PAIRWISE PREFERENCE-BASED WINRATE To demonstrate that shifting the training objective to our proposed pairwise preference fitting enhances stability and that pairwise c… view at source ↗
Figure 8
Figure 8. Figure 8: Reward Hacking Visualization of HPS. At around step 160, the image quality begins to degrade, even though the reward score continues to rise, indicating the occurrence of reward hacking. A.4 MORE REWARD HACKING ANALYSIS We further visualize the phenomenon of reward hacking when using HPS (Wu et al., 2023) as the pointwise reward model. As shown in [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative Results of Joint Optimization. Joint training with CLIP improves semantic consistency while slightly degrading perceptual quality, reflecting the inherent trade-off. A beige pastry sitting in a white ball next to a spoon. A dragon-shaped teapot. A desert scene with sand that transitions from gold to sapphire blue. A surfboard shaped like a lightning bolt, balanced on a volcanic rock. Tsaurus re… view at source ↗
Figure 10
Figure 10. Figure 10: More Qualitative Comparison. We compare PREF-GRPO with several pointwise RM-based GRPO methods, demonstrating its superior performance and effectiveness. 2024), Bagel (Deng et al., 2025), BLIP3-o (Chen et al., 2025a), CogVideo4 (Ding et al., 2021), Hunyuan-DiT (Li et al., 2024b), Janus (Wu et al., 2025b), Janus-flow (Ma et al., 2024), Emu3 (Wang et al., 2024), Playground2.5 (Li et al., 2024a), and SDXL (R… view at source ↗
Figure 11
Figure 11. Figure 11: Fine-grained Benchmarking Results of T2I models on UNIGENBENCH. Best scores are in green, second-best in yellow. A row of penguins stood neatly on the ice sheet. They all held their heads high and looked at the aurora... In an (Impressionist/surrealist...) oil painting... An Art Deco sculpture... Dark fantasy art style, an ancient magic book with several exquisite brass hands sticking out from the cover a… view at source ↗
Figure 12
Figure 12. Figure 12: Prompt Themes of UNIGENBENCH. We provide representative prompt examples for each theme to facilitate understanding. B.4 SUBJECT CATEGORIES As shown in [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Evaluation Dimensions of UNIGENBENCH. We provide representative prompt examples for each evaluation dimension to facilitate understanding. Distribution of the number of testpoints per prompt 64 155 159 147 75 Number of Testpoints Number of Prompts 1 2 3 4 5 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Distribution of Testpoint Counts in Prompts. This figure presents the distribution of the number of testpoints per prompt in UNIGENBENCH. • Facial Expressions: Assesses whether generated characters exhibit correct and contextually appropriate emotions. • Pronoun Reference: Tests the model’s capability to resolve ambiguous pronouns (e.g., its, him) correctly. • Hand Actions: Examines whether fine-grained h… view at source ↗
read the original abstract

Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper claims that pointwise reward model scores in GRPO for text-to-image generation lead to reward hacking because normalization amplifies minimal differences into unstable advantages. Pref-GRPO addresses this by using a preference reward model to perform pairwise comparisons within groups of images and deriving win rates as the reward signal, shifting the objective to preference fitting. The work also introduces UniGenBench, a benchmark with 600 prompts across 5 themes and 20 subthemes, using MLLM-based evaluation on 10 primary and 27 sub-criteria for semantic consistency. Experiments are said to show more stable training and better differentiation of subtle quality differences.

Significance. If the central empirical claims hold, Pref-GRPO would represent a targeted improvement to GRPO-style RL for T2I models by replacing a known source of advantage instability with pairwise preference signals. UniGenBench could fill a gap in coarse existing T2I evaluation by providing finer-grained, multi-criteria assessment. The approach builds directly on cited GRPO and preference RM components without introducing new free parameters in the core formulation.

major comments (3)
  1. [§3] §3 (Method): The claim that pairwise win rates from the preference RM yield stable advantages rests on the unverified assumption that the RM supplies unbiased signals for subtle differences; no inter-rater agreement, cycle detection, or adversarial robustness results for the RM are reported, which is load-bearing for the reward-hacking mitigation argument.
  2. [§4.2] §4.2 (Experiments): The reported stability improvements lack controls or ablations isolating the effect of win-rate normalization versus other factors such as policy gradient variance or prompt distribution; without these, attribution of gains to the pairwise formulation is not fully secured.
  3. [Table 3] Table 3 or equivalent results table: No standard deviations, number of random seeds, or statistical significance tests are provided for the metric gains over GRPO baselines, weakening the cross-method stability claim.
minor comments (3)
  1. [Abstract] Abstract: Inconsistent capitalization of 'Pref-GRPO' versus 'PREF-GRPO' should be unified.
  2. [§2] §2 (Related Work): GRPO acronym is used before its first expansion; add the full name on first occurrence.
  3. [Figure 2] Figure 2: The pipeline diagram would be clearer with explicit annotation of the within-group pairwise comparison step and win-rate aggregation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, acknowledging where the original manuscript could be strengthened and outlining planned revisions.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The claim that pairwise win rates from the preference RM yield stable advantages rests on the unverified assumption that the RM supplies unbiased signals for subtle differences; no inter-rater agreement, cycle detection, or adversarial robustness results for the RM are reported, which is load-bearing for the reward-hacking mitigation argument.

    Authors: We agree that the reliability of the preference RM is central to the argument. The RM employed is a standard model pretrained on large-scale human preference datasets for T2I tasks, as cited in the manuscript. While the original submission did not report inter-rater agreement, cycle detection, or adversarial robustness tests specific to this RM, the pairwise formulation mitigates sensitivity to small absolute score variations by design. To address the concern, we will add a dedicated paragraph in §3 discussing the RM's established properties from prior literature and explicitly noting the absence of custom robustness experiments as a limitation. revision: yes

  2. Referee: [§4.2] §4.2 (Experiments): The reported stability improvements lack controls or ablations isolating the effect of win-rate normalization versus other factors such as policy gradient variance or prompt distribution; without these, attribution of gains to the pairwise formulation is not fully secured.

    Authors: The referee correctly identifies that stronger isolation of the win-rate component would improve attribution. Our experiments primarily compared full Pref-GRPO against GRPO baselines and showed improved stability, but did not include targeted ablations holding gradient variance and prompt distribution fixed while varying only the reward signal. We will incorporate additional ablation experiments in the revised §4.2 that systematically vary the reward formulation (pointwise vs. pairwise win-rate) under controlled conditions to better isolate its contribution. revision: yes

  3. Referee: [Table 3] Table 3 or equivalent results table: No standard deviations, number of random seeds, or statistical significance tests are provided for the metric gains over GRPO baselines, weakening the cross-method stability claim.

    Authors: We acknowledge that the lack of variability measures and statistical tests limits the strength of the stability claims. The original tables reported mean performance metrics without standard deviations, seed counts, or significance testing. In the revision we will update Table 3 and related result tables to include standard deviations computed over multiple random seeds (at least three) and add statistical significance indicators for key comparisons against the GRPO baseline. revision: yes

Circularity Check

0 steps flagged

No significant circularity: Pref-GRPO introduces independent pairwise reformulation on top of cited GRPO and preference RM components

full rationale

The paper's central move replaces pointwise RM scoring and normalization with within-group pairwise win rates from a preference RM as the reward signal. This is presented as a direct methodological substitution rather than a derivation that reduces by construction to fitted parameters, self-definitions, or prior self-citations. No equations or load-bearing steps in the provided abstract or description equate the output advantage stability to the input normalization step or to any self-referential loop. The claim of mitigating reward hacking rests on the new objective having independent content, consistent with the reader's assessment of score 2.0 for minor non-load-bearing elements only.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on standard GRPO optimization and existing preference reward models from prior work; no new free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Preference reward models can reliably distinguish subtle quality differences between images in a manner superior to pointwise scoring after normalization.
    Invoked to justify replacing pointwise scores with win rates as the reward signal.

pith-pipeline@v0.9.0 · 5813 in / 1346 out tokens · 63939 ms · 2026-05-18T20:27:28.653433+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Efficient Adjoint Matching for Fine-tuning Diffusion Models

    cs.LG 2026-05 unverdicted novelty 7.0

    EAM speeds up adjoint matching for diffusion model reward fine-tuning by switching to linear base drift, allowing deterministic few-step solvers and closed-form adjoints with up to 4x faster convergence on text-to-ima...

  2. TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    TMPO replaces scalar reward maximization with trajectory-level matching to a Boltzmann distribution via Softmax-TB, improving generative diversity by 9.1% while keeping competitive reward performance.

  3. TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.

  4. HP-Edit: A Human-Preference Post-Training Framework for Image Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.

  5. Self-Improving Tabular Language Models via Iterative Group Alignment

    cs.LG 2026-04 unverdicted novelty 7.0

    TabGRAA enables self-improving tabular language models through iterative group-relative advantage alignment using modular automated quality signals like distinguishability classifiers.

  6. LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

    cs.CV 2026-04 unverdicted novelty 7.0

    LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

  7. Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

    cs.CV 2026-05 unverdicted novelty 6.0

    Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...

  8. Leveraging Verifier-Based Reinforcement Learning in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.

  9. Reward-Aware Trajectory Shaping for Few-step Visual Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.

  10. Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

    cs.LG 2026-04 unverdicted novelty 5.0

    The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...

  11. PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards

    cs.CV 2025-12 conditional novelty 5.0

    A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 10 Pith papers · 15 internal anchors

  1. [1]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301,

  2. [2]

    HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

    Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer. arXiv preprint arXiv:2505.22705,

  3. [3]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025a. Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and C...

  4. [4]

    Cogview: Mastering text-to-image generation via transformers

    Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. Cogview: Mastering text-to-image generation via transformers. arXiv preprint arXiv:2105.13290,

  5. [5]

    Seedream 3.0 Technical Report

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346,

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,

  7. [7]

    Tempflow-grpo: When timing matters for grpo in flow models

    Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, and Bo Zhang. Tempflow-grpo: When timing matters for grpo in flow models. arXiv preprint arXiv:2508.04324,

  8. [8]

    arXiv preprint arXiv:2507.15855,

  9. [9]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276,

  10. [10]

    Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024a. 10 arXiv preprint Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow-based grpo ef...

  11. [11]

    Flow Matching for Generative Modeling

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...

  12. [12]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. arXiv preprint arXiv:2505.05470,

  13. [13]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003,

  14. [14]

    WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv preprint arXiv:2503.07265,

  15. [15]

    Delving into rl for image generation with cot: A study on dpo vs

    Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, and Pheng-Ann Heng. Delving into rl for image generation with cot: A study on dpo vs. grpo. arXiv preprint arXiv:2505.17017,

  16. [16]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869,

  17. [17]

    Unified multimodal chain-of-thought reward model through reinforcement fine-tuning

    Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning. arXiv preprint arXiv:2505.03318, 2025a. Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation. arXiv preprint arXi...

  18. [18]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report. arXiv preprint arXiv:2508.02324, 2025a. Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding...

  19. [19]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. arXiv preprint arXiv:2506.15564,

  20. [20]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818,

  21. [21]

    Specifically, we observe that HPS tends to favor images with saturated colors

    12 arXiv preprint A P REF -GRPO A.1 W HY PAIRWISE PREFERENCE -BASED REWARD WORKS This work finds that reward hacking is fundamentally caused by the model overly aligning with the reward model’s preferences. Specifically, we observe that HPS tends to favor images with saturated colors. However, when the model excessively optimizes this preference, reward h...

  22. [22]

    However, these methods typically attempt to alleviate the problem by adjusting experimental settings, such as incorporating the KL loss (Liu et al., 2025)

    also discuss the issue of reward hacking, recognizing it as a pervasive challenge in the field. However, these methods typically attempt to alleviate the problem by adjusting experimental settings, such as incorporating the KL loss (Liu et al., 2025). In contrast, our work reveals that the underlying cause of reward hacking is the issue of illusory advant...

  23. [23]

    As shown in Fig

    as the pointwise reward model. As shown in Fig. 8, the reward score increases sharply during training, yet the model quality begins to deteriorate around step 160, manifesting as over-saturated colors. Despite this degradation, the reward score continues to rise. This indicates the presence of theillusory advantage problem, where the model excessively opt...

  24. [24]

    To the best of our knowledge, this is the most comprehensive benchmark in terms of evaluation dimensions

    B.7 S UPERIORITY OF UNIGENBENCH The superiority of UNIGENBENCH can be summarized as follows: • Comprehensive Dimension Evaluation : It spans 10 primary dimensions and 27 sub- dimensions, offering a systematic and in-depth assessment of a model’s capabilities across various aspects. To the best of our knowledge, this is the most comprehensive benchmark in ...