Balancing Performance and Diversity in GRPO Autoregressive Text-to-Image Post-Training

Chunru Yang; Hongbo Duan; Jiahua Pei; Xueqian Wang; Yi Liu; Yuanhao Chiang

arxiv: 2606.21498 · v1 · pith:I4HYJILAnew · submitted 2026-06-19 · 💻 cs.AI · cs.LG

Balancing Performance and Diversity in GRPO Autoregressive Text-to-Image Post-Training

Yuanhao Chiang , Hongbo Duan , Chunru Yang , Jiahua Pei , Yi Liu , Xueqian Wang This is my paper

Pith reviewed 2026-06-26 14:13 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords GRPOautoregressive text-to-imagef-divergenceJS divergencepolicy optimizationreinforcement learning alignmentgeneration diversity

0 comments

The pith

JS divergence achieves the best trade-off between optimization performance and generation diversity in GRPO-style autoregressive text-to-image alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how reference-policy divergence choices affect token-level updates inside GRPO optimization for autoregressive text-to-image models. It places forward KL, reverse KL, and JS divergence inside a single f-divergence framework and shows that each divergence produces a distinct reshaping of the sampled-token updates. Under the shaping form actually used in training, JS regularization reduces the uniform bias that appears with other divergences while still limiting large policy shifts. Experiments on LlamaGen and Janus-7B confirm that JS yields the strongest or near-strongest scores on standard metrics and simultaneously preserves favorable output diversity.

Core claim

Under the sampled-token shaping form used, JS regularization achieves a favorable trade-off by mitigating uniform bias relative to the reference policy while still discouraging large deviations. Extensive experiments on LlamaGen and Janus-7B show that JS divergence achieves the strongest or highly competitive optimization performance on most evaluation metrics while maintaining favorable generation diversity.

What carries the argument

The unified f-divergence framework (forward KL, reverse KL, JS) applied to token-level updates inside the sampled-token shaping form of GRPO.

If this is right

JS divergence yields strongest or near-strongest scores on standard T2I alignment metrics.
JS divergence preserves higher generation diversity than the alternatives under the same shaping form.
Different f-divergences produce measurably distinct token-level update distributions during GRPO.
The choice of divergence is a controllable lever that directly trades off bias toward uniformity against policy deviation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same divergence analysis could be applied to other autoregressive generation tasks that already use GRPO-style updates.
If the shaping form changes, the relative advantage of JS may shift, suggesting a need to re-derive the token-level bias for each new shaping rule.
The framework supplies a practical recipe for selecting the divergence term when both performance and diversity must be controlled simultaneously.

Load-bearing premise

The unified f-divergence framework and sampled-token shaping form correctly describe how token-level updates actually occur during GRPO training of autoregressive T2I models.

What would settle it

A controlled run on the same models and data where JS divergence fails to match or exceed the other divergences on the majority of metrics or shows clearly lower diversity would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.21498 by Chunru Yang, Hongbo Duan, Jiahua Pei, Xueqian Wang, Yi Liu, Yuanhao Chiang.

**Figure 2.** Figure 2: Qualitative comparison of LlamaGen variants trained with RKL divergence, FKL divergence, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Training reward dynamics under varying divergence regularization, showing HPS-v2 reward [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Autoregressive text-to-image (T2I) generation has recently advanced rapidly, yet aligning generated images with human preferences remains challenging. GRPO-style online reinforcement learning provides an effective framework; however, existing methods typically treat reference-policy divergence as fixed, despite its direct impact on policy optimization. We study this overlooked factor within a unified f-divergence framework, encompassing forward KL, reverse KL, and JS divergence, for GRPO-style autoregressive T2I alignment. Our systematic theoretical analysis reveals that different divergences reshape token-level updates in distinct ways. In particular, under the sampled-token shaping form used, JS regularization achieves a favorable trade-off by mitigating uniform bias relative to the reference policy while still discouraging large deviations. Extensive experiments on LlamaGen and Janus-7B show that JS divergence achieves the strongest or highly competitive optimization performance on most evaluation metrics while maintaining favorable generation diversity. The code is available at https://github.com/tuoyou-hao/BPD-GRPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JS divergence looks practically useful in GRPO for T2I but the token-update theory needs direct checks against actual gradients.

read the letter

The paper's main takeaway is that swapping in Jensen-Shannon divergence inside GRPO for autoregressive text-to-image models gives a workable trade-off between alignment metrics and output diversity on LlamaGen and Janus-7B. They frame the divergence choice as an overlooked knob and run a side-by-side comparison under a sampled-token shaping form.

What is actually new is the targeted application of the existing f-divergence family to this exact setting, plus the claim that JS reduces uniform bias relative to the reference while still regularizing large steps. The experiments are straightforward, they report results across standard metrics, and they release the code, which makes the work easy to inspect or extend.

The soft spot is the theoretical analysis. The abstract and stress-test note both point to the same gap: the derived token-level update directions under the shaping form are presented as explaining the results, but there is no direct evidence shown that those directions match the gradients observed during real GRPO steps on sequential autoregressive sampling with online rewards. If the independence or fixed-reference assumptions in the shaping form do not hold, the explanation for why JS wins could be post-hoc. The experiments themselves appear clean enough on the two models, but without that validation the causal story stays weaker than the performance numbers.

This is for people already working on RL post-training of generative models who want to try a different divergence without building a new framework. A reader running similar GRPO setups would get immediate value from the code and the reported trade-offs.

It is worth sending to peer review. The experiments and code give it enough substance for referees to check the claims, even if the theory section needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper studies the role of reference-policy divergence in GRPO-style online RL for autoregressive T2I models within a unified f-divergence framework (forward KL, reverse KL, JS). It provides a theoretical analysis of how these divergences reshape token-level updates under a sampled-token shaping form, claiming that JS divergence mitigates uniform bias relative to the reference while discouraging large deviations, yielding a favorable performance-diversity trade-off. Experiments on LlamaGen and Janus-7B show JS achieving the strongest or highly competitive results on most metrics while preserving generation diversity. Code is released.

Significance. If the theoretical reshaping analysis holds and directly accounts for the observed metric improvements, the work supplies a principled basis for selecting regularization in GRPO-style T2I alignment, addressing an overlooked factor in balancing optimization strength against diversity. Credit is due for the code release and for testing on two distinct autoregressive backbones. The contribution would be strengthened by explicit linkage between local token-update predictions and training dynamics.

major comments (2)

[Theoretical analysis] Theoretical analysis section: The central claim that the sampled-token shaping form produces distinct, predictable effects for JS versus KL (mitigating uniform bias while limiting deviations) is load-bearing for explaining the experimental trade-off. The analysis appears to rely on assumptions such as token independence and fixed reference behavior; these may not hold under sequential autoregressive sampling and online reward feedback in GRPO. Direct validation is needed that the derived token-update directions match observed gradients during actual training steps on LlamaGen or Janus-7B.
[Experiments] Experiments section: The claim that JS divergence yields the strongest or highly competitive optimization performance on most metrics rests on the theory-to-practice mapping. Without reported ablations or correlation analysis showing that the predicted bias mitigation (rather than other implementation choices) drives the metric gains across both models, alternative explanations for the results cannot be ruled out.

minor comments (2)

[Abstract] Abstract and introduction: Expand the description of the sampled-token shaping form and the precise definition of uniform bias so that readers can follow the theoretical claims without first reading the full derivation.
[Experiments] Evaluation tables: If multiple random seeds were used, report standard deviations or error bars alongside the metric values to allow assessment of whether reported differences are statistically meaningful.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the linkage between theory and experiments. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Theoretical analysis] Theoretical analysis section: The central claim that the sampled-token shaping form produces distinct, predictable effects for JS versus KL (mitigating uniform bias while limiting deviations) is load-bearing for explaining the experimental trade-off. The analysis appears to rely on assumptions such as token independence and fixed reference behavior; these may not hold under sequential autoregressive sampling and online reward feedback in GRPO. Direct validation is needed that the derived token-update directions match observed gradients during actual training steps on LlamaGen or Janus-7B.

Authors: We agree the analysis employs simplifying assumptions (token independence and fixed reference) to obtain closed-form token-update expressions under the sampled-token shaping form. These are standard for isolating divergence effects and yield qualitative predictions that align with the observed performance-diversity trade-offs. We will revise the theoretical section to explicitly list the assumptions and discuss their scope in autoregressive GRPO with online rewards. For validation, we will add a new analysis subsection reporting correlation between predicted update directions and observed per-token gradient statistics from training runs on both backbones. revision: partial
Referee: [Experiments] Experiments section: The claim that JS divergence yields the strongest or highly competitive optimization performance on most metrics rests on the theory-to-practice mapping. Without reported ablations or correlation analysis showing that the predicted bias mitigation (rather than other implementation choices) drives the metric gains across both models, alternative explanations for the results cannot be ruled out.

Authors: The experimental protocol holds all GRPO components fixed while varying only the divergence (forward KL, reverse KL, JS) on both LlamaGen and Janus-7B; this isolates the divergence choice as the source of differences. To further tie results to the predicted bias mitigation, we will add correlation analysis between measured deviation from uniform/reference behavior and metric gains across runs. revision: partial

Circularity Check

0 steps flagged

No circularity: theoretical analysis presented as independent derivation from f-divergence framework

full rationale

The abstract describes a systematic theoretical analysis within a unified f-divergence framework that reveals how divergences reshape token-level updates under sampled-token shaping. No equations, self-citations, or fitted parameters are shown that would make any prediction equivalent to its inputs by construction. The JS divergence trade-off claim is framed as an outcome of the analysis rather than a renaming, self-definition, or load-bearing self-citation. The derivation chain remains self-contained against external benchmarks with no reduction to fitted inputs or prior author work invoked as uniqueness theorem.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The f-divergence framework is treated as standard background.

pith-pipeline@v0.9.1-grok · 5716 in / 1041 out tokens · 19875 ms · 2026-06-26T14:13:14.860129+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 8 linked inside Pith

[1]

Automatic thresholding of gray-level pictures using two-dimensional entropy.Computer vision, graphics, and image processing, 47(1):22–32, 1989

Ahmed S Abutaleb. Automatic thresholding of gray-level pictures using two-dimensional entropy.Computer vision, graphics, and image processing, 47(1):22–32, 1989

1989
[2]

Entropy-aware preference alignment for diffusion- based text-to-image generation

Hannan Bai, Haoyuan Sun, and Yuncheng Du. Entropy-aware preference alignment for diffusion- based text-to-image generation. InChinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 373–387. Springer, 2025

2025
[3]

Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature.Geoscientific model development, 7 (3):1247–1250, 2014

Tianfeng Chai and Roland R Draxler. Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature.Geoscientific model development, 7 (3):1247–1250, 2014

2014
[4]

On information-type measure of difference of probability distributions and indirect observations.Studia Sci

Imre Csiszár. On information-type measure of difference of probability distributions and indirect observations.Studia Sci. Math. Hungar., 2:299–318, 1967

1967
[5]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

2023
[6]

Viss-r1: Self-supervised reinforcement video reasoning

Bo Fang, Yuxin Song, Haoyuan Sun, Xinyao Zhang, Qiangqiang Wu, Wenhao Wu, and Antoni B Chan. Viss-r1: Self-supervised reinforcement video reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11190–11200, 2026

2026
[7]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

2023
[8]

Aligning language models with preferences through f-divergence minimization

Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, and Marc Dymetman. Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215, 2023

arXiv 2023
[9]

Generative adversarial nets.Advances in neural information processing systems, 27, 2014

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014

2014
[10]

Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

Pith/arXiv arXiv 2024
[11]

Scope of validity of psnr in image/video quality assessment.Electronics letters, 44(13):800–801, 2008

Quan Huynh-Thu and Mohammed Ghanbari. Scope of validity of psnr in image/video quality assessment.Electronics letters, 44(13):800–801, 2008

2008
[12]

On information and sufficiency.The annals of mathematical statistics, 22(1):79–86, 1951

Solomon Kullback and Richard A Leibler. On information and sufficiency.The annals of mathematical statistics, 22(1):79–86, 1951

1951
[13]

Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, et al. Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

arXiv 2024
[14]

Align- ing diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Align- ing diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024

2024
[15]

Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2(5):7, 2024

Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, and Liang Zheng. Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2(5):7, 2024. 9

arXiv 2024
[16]

Divergence measures based on the shannon entropy.IEEE Transactions on Information theory, 37(1):145–151, 1991

Jianhua Lin. Divergence measures based on the shannon entropy.IEEE Transactions on Information theory, 37(1):145–151, 1991

1991
[17]

Evaluating text-to-visual generation with image-to-text generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. In European Conference on Computer Vision, pages 366–384. Springer, 2024

2024
[18]

Sample by step, optimize by chunk: Chunk-level grpo for text-to-image generation.arXiv preprint arXiv:2510.21583, 2025

Yifu Luo, Penghui Du, Bo Li, Sinan Du, Tiantian Zhang, Yongzhe Chang, Kai Wu, Kun Gai, and Xueqian Wang. Sample by step, optimize by chunk: Chunk-level grpo for text-to-image generation.arXiv preprint arXiv:2510.21583, 2025

Pith/arXiv arXiv 2025
[19]

Reinforcement learning meets masked generative models: Mask- grpo for text-to-image generation.Advances in Neural Information Processing Systems, 38: 108460–108485, 2026

Yifu Luo, Xinhao Hu, Keyu Fan, Haoyuan Sun, Zeyu Chen, Bo Xia, Tiantian Zhang, Yongzhe Chang, and Xueqian Wang. Reinforcement learning meets masked generative models: Mask- grpo for text-to-image generation.Advances in Neural Information Processing Systems, 38: 108460–108485, 2026

2026
[20]

The hidden link between rlhf and contrastive learning.arXiv preprint arXiv:2506.22578, 2025

Xufei Lv, Kehai Chen, Haoyuan Sun, Xuefeng Bai, Min Zhang, and Houde Liu. The hidden link between rlhf and contrastive learning.arXiv preprint arXiv:2506.22578, 2025

arXiv 2025
[21]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[22]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[23]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017
[24]

A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

Claude Elwood Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

1948
[25]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[26]

Cologen: Progressive learning of concept-localization duality for unified image generation

Yuxin Song, Yu Lu, Haoyuan Sun, Huanjin Yao, Fanglong Liu, Yifan Sun, Haocheng Feng, Hang Zhou, and Jingdong Wang. Cologen: Progressive learning of concept-localization duality for unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14724–14734, 2026

2026
[27]

Generalizing offline alignment theoretical paradigm with diverse divergence constraints

Haoyuan Sun, Yuxin Zheng, Yifei Zhao, Yongzhe Chang, and Xueqian Wang. Generalizing offline alignment theoretical paradigm with diverse divergence constraints. InICML 2024 Workshop on Models of Human Feedback for AI Alignment, 2024

2024
[28]

Reinforcement fine-tuning powers reasoning capability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, and Xueqian Wang. Reinforcement fine-tuning powers reasoning capability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

arXiv 2025
[29]

Generalizing alignment paradigm of text-to-image generation with preferences through f-divergence minimization

Haoyuan Sun, Bo Xia, Yongzhe Chang, and Xueqian Wang. Generalizing alignment paradigm of text-to-image generation with preferences through f-divergence minimization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27644–27652, 2025

2025
[30]

Identical human prefer- ence alignment paradigm for text-to-image models

Haoyuan Sun, Bo Xia, Yifei Zhao, Yongzhe Chang, and Xueqian Wang. Identical human prefer- ence alignment paradigm for text-to-image models. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025

2025
[31]

Positive enhanced preference alignment for text-to-image models

Haoyuan Sun, Bo Xia, Yifei Zhao, Yongzhe Chang, and Xueqian Wang. Positive enhanced preference alignment for text-to-image models. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025. 10

2025
[32]

Power reinforcement post-training of text-to-image models with super-linear advantage shaping.arXiv preprint arXiv:2605.10937, 2026

Haoyuan Sun, Jing Wang, Yuxin Song, Yu Lu, Bo Fang, Yifu Luo, Jun Yin, Pengyu Zeng, Miao Zhang, Tiantian Zhang, et al. Power reinforcement post-training of text-to-image models with super-linear advantage shaping.arXiv preprint arXiv:2605.10937, 2026

Pith/arXiv arXiv 2026
[33]

Calibration enhanced decision maker: Towards trustworthy sequential decision-making with large sequence models

Haoyuan Sun, Bo Xia, Yifu Luo, Tiantian Zhang, and Xueqian Wang. Calibration enhanced decision maker: Towards trustworthy sequential decision-making with large sequence models. Transactions on Machine Learning Research, 2026

2026
[34]

Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

Pith/arXiv arXiv 2024
[35]

Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

1999
[36]

Policy gradient methods for reinforcement learning with function approximation and action-dependent baselines.arXiv preprint arXiv:1706.06643, 2017

Philip S Thomas and Emma Brunskill. Policy gradient methods for reinforcement learning with function approximation and action-dependent baselines.arXiv preprint arXiv:1706.06643, 2017

Pith/arXiv arXiv 2017
[37]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

2024
[38]

Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints

Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, and Yuxin Chen. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints. InInternational Conference on Learning Representations, volume 2024, pages 10450–10480, 2024

2024
[39]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4): 600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4): 600–612, 2004

2004
[40]

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

2025
[41]

Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

Pith/arXiv arXiv 2023
[42]

Human preference score: Better aligning text-to-image models with human preference

Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score: Better aligning text-to-image models with human preference. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2096–2105, 2023

2096
[43]

A delay-robust method for enhanced real-time reinforcement learning.Neural Networks, 181:106769, 2025

Bo Xia, Haoyuan Sun, Bo Yuan, Zhiheng Li, Bin Liang, and Xueqian Wang. A delay-robust method for enhanced real-time reinforcement learning.Neural Networks, 181:106769, 2025

2025
[44]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

2023
[45]

Using human feedback to fine-tune diffusion models without any reward model

Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

2024
[46]

Floorplan-llama: Aligning architects’ feedback and domain knowledge in architectural floor plan generation

Jun Yin, Pengyu Zeng, Haoyuan Sun, Yuqin Dai, Han Zheng, Miao Zhang, Yachao Zhang, and Shuai Lu. Floorplan-llama: Aligning architects’ feedback and domain knowledge in architectural floor plan generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6640–6662, 2025. 11

2025
[47]

Ai-empowered prediction of office building energy use from single-view conceptual images for early-stage design.Applied Energy, 406:127289, 2026

Jun Yin, Pengyu Zeng, Yujian Huang, Haoyuan Sun, Tianze Hao, Shuai Lu, et al. Ai-empowered prediction of office building energy use from single-view conceptual images for early-stage design.Applied Energy, 406:127289, 2026

2026
[48]

Mred-14: A benchmark for low-energy residential floor plan generation with 14 flexible inputs

Pengyu Zeng, Jun Yin, Haoyuan Sun, Yuqin Dai, Maowei Jiang, Miao Zhang, and Shuai Lu. Mred-14: A benchmark for low-energy residential floor plan generation with 14 flexible inputs. InProceedings of the 33rd ACM International Conference on Multimedia, pages 11298–11307, 2025

2025
[49]

Group critical-token policy optimization for autoregressive image generation.arXiv preprint arXiv:2509.22485, 2025

Guohui Zhang, Hu Yu, Xiaoxiao Ma, JingHao Zhang, Yaning Pan, Mingde Yao, Jie Xiao, Linjiang Huang, and Feng Zhao. Group critical-token policy optimization for autoregressive image generation.arXiv preprint arXiv:2509.22485, 2025

arXiv 2025
[50]

Fsim: A feature similarity index for image quality assessment.IEEE transactions on Image Processing, 20(8):2378–2386, 2011

Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image quality assessment.IEEE transactions on Image Processing, 20(8):2378–2386, 2011

2011
[51]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 12

2018

[1] [1]

Automatic thresholding of gray-level pictures using two-dimensional entropy.Computer vision, graphics, and image processing, 47(1):22–32, 1989

Ahmed S Abutaleb. Automatic thresholding of gray-level pictures using two-dimensional entropy.Computer vision, graphics, and image processing, 47(1):22–32, 1989

1989

[2] [2]

Entropy-aware preference alignment for diffusion- based text-to-image generation

Hannan Bai, Haoyuan Sun, and Yuncheng Du. Entropy-aware preference alignment for diffusion- based text-to-image generation. InChinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 373–387. Springer, 2025

2025

[3] [3]

Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature.Geoscientific model development, 7 (3):1247–1250, 2014

Tianfeng Chai and Roland R Draxler. Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature.Geoscientific model development, 7 (3):1247–1250, 2014

2014

[4] [4]

On information-type measure of difference of probability distributions and indirect observations.Studia Sci

Imre Csiszár. On information-type measure of difference of probability distributions and indirect observations.Studia Sci. Math. Hungar., 2:299–318, 1967

1967

[5] [5]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

2023

[6] [6]

Viss-r1: Self-supervised reinforcement video reasoning

Bo Fang, Yuxin Song, Haoyuan Sun, Xinyao Zhang, Qiangqiang Wu, Wenhao Wu, and Antoni B Chan. Viss-r1: Self-supervised reinforcement video reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11190–11200, 2026

2026

[7] [7]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

2023

[8] [8]

Aligning language models with preferences through f-divergence minimization

Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, and Marc Dymetman. Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215, 2023

arXiv 2023

[9] [9]

Generative adversarial nets.Advances in neural information processing systems, 27, 2014

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014

2014

[10] [10]

Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

Pith/arXiv arXiv 2024

[11] [11]

Scope of validity of psnr in image/video quality assessment.Electronics letters, 44(13):800–801, 2008

Quan Huynh-Thu and Mohammed Ghanbari. Scope of validity of psnr in image/video quality assessment.Electronics letters, 44(13):800–801, 2008

2008

[12] [12]

On information and sufficiency.The annals of mathematical statistics, 22(1):79–86, 1951

Solomon Kullback and Richard A Leibler. On information and sufficiency.The annals of mathematical statistics, 22(1):79–86, 1951

1951

[13] [13]

Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, et al. Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

arXiv 2024

[14] [14]

Align- ing diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Align- ing diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024

2024

[15] [15]

Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2(5):7, 2024

Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, and Liang Zheng. Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2(5):7, 2024. 9

arXiv 2024

[16] [16]

Divergence measures based on the shannon entropy.IEEE Transactions on Information theory, 37(1):145–151, 1991

Jianhua Lin. Divergence measures based on the shannon entropy.IEEE Transactions on Information theory, 37(1):145–151, 1991

1991

[17] [17]

Evaluating text-to-visual generation with image-to-text generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. In European Conference on Computer Vision, pages 366–384. Springer, 2024

2024

[18] [18]

Sample by step, optimize by chunk: Chunk-level grpo for text-to-image generation.arXiv preprint arXiv:2510.21583, 2025

Yifu Luo, Penghui Du, Bo Li, Sinan Du, Tiantian Zhang, Yongzhe Chang, Kai Wu, Kun Gai, and Xueqian Wang. Sample by step, optimize by chunk: Chunk-level grpo for text-to-image generation.arXiv preprint arXiv:2510.21583, 2025

Pith/arXiv arXiv 2025

[19] [19]

Reinforcement learning meets masked generative models: Mask- grpo for text-to-image generation.Advances in Neural Information Processing Systems, 38: 108460–108485, 2026

Yifu Luo, Xinhao Hu, Keyu Fan, Haoyuan Sun, Zeyu Chen, Bo Xia, Tiantian Zhang, Yongzhe Chang, and Xueqian Wang. Reinforcement learning meets masked generative models: Mask- grpo for text-to-image generation.Advances in Neural Information Processing Systems, 38: 108460–108485, 2026

2026

[20] [20]

The hidden link between rlhf and contrastive learning.arXiv preprint arXiv:2506.22578, 2025

Xufei Lv, Kehai Chen, Haoyuan Sun, Xuefeng Bai, Min Zhang, and Houde Liu. The hidden link between rlhf and contrastive learning.arXiv preprint arXiv:2506.22578, 2025

arXiv 2025

[21] [21]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022

[22] [22]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[23] [23]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017

[24] [24]

A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

Claude Elwood Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

1948

[25] [25]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[26] [26]

Cologen: Progressive learning of concept-localization duality for unified image generation

Yuxin Song, Yu Lu, Haoyuan Sun, Huanjin Yao, Fanglong Liu, Yifan Sun, Haocheng Feng, Hang Zhou, and Jingdong Wang. Cologen: Progressive learning of concept-localization duality for unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14724–14734, 2026

2026

[27] [27]

Generalizing offline alignment theoretical paradigm with diverse divergence constraints

Haoyuan Sun, Yuxin Zheng, Yifei Zhao, Yongzhe Chang, and Xueqian Wang. Generalizing offline alignment theoretical paradigm with diverse divergence constraints. InICML 2024 Workshop on Models of Human Feedback for AI Alignment, 2024

2024

[28] [28]

Reinforcement fine-tuning powers reasoning capability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, and Xueqian Wang. Reinforcement fine-tuning powers reasoning capability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

arXiv 2025

[29] [29]

Generalizing alignment paradigm of text-to-image generation with preferences through f-divergence minimization

Haoyuan Sun, Bo Xia, Yongzhe Chang, and Xueqian Wang. Generalizing alignment paradigm of text-to-image generation with preferences through f-divergence minimization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27644–27652, 2025

2025

[30] [30]

Identical human prefer- ence alignment paradigm for text-to-image models

Haoyuan Sun, Bo Xia, Yifei Zhao, Yongzhe Chang, and Xueqian Wang. Identical human prefer- ence alignment paradigm for text-to-image models. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025

2025

[31] [31]

Positive enhanced preference alignment for text-to-image models

Haoyuan Sun, Bo Xia, Yifei Zhao, Yongzhe Chang, and Xueqian Wang. Positive enhanced preference alignment for text-to-image models. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025. 10

2025

[32] [32]

Power reinforcement post-training of text-to-image models with super-linear advantage shaping.arXiv preprint arXiv:2605.10937, 2026

Haoyuan Sun, Jing Wang, Yuxin Song, Yu Lu, Bo Fang, Yifu Luo, Jun Yin, Pengyu Zeng, Miao Zhang, Tiantian Zhang, et al. Power reinforcement post-training of text-to-image models with super-linear advantage shaping.arXiv preprint arXiv:2605.10937, 2026

Pith/arXiv arXiv 2026

[33] [33]

Calibration enhanced decision maker: Towards trustworthy sequential decision-making with large sequence models

Haoyuan Sun, Bo Xia, Yifu Luo, Tiantian Zhang, and Xueqian Wang. Calibration enhanced decision maker: Towards trustworthy sequential decision-making with large sequence models. Transactions on Machine Learning Research, 2026

2026

[34] [34]

Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

Pith/arXiv arXiv 2024

[35] [35]

Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

1999

[36] [36]

Policy gradient methods for reinforcement learning with function approximation and action-dependent baselines.arXiv preprint arXiv:1706.06643, 2017

Philip S Thomas and Emma Brunskill. Policy gradient methods for reinforcement learning with function approximation and action-dependent baselines.arXiv preprint arXiv:1706.06643, 2017

Pith/arXiv arXiv 2017

[37] [37]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

2024

[38] [38]

Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints

Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, and Yuxin Chen. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints. InInternational Conference on Learning Representations, volume 2024, pages 10450–10480, 2024

2024

[39] [39]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4): 600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4): 600–612, 2004

2004

[40] [40]

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

2025

[41] [41]

Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

Pith/arXiv arXiv 2023

[42] [42]

Human preference score: Better aligning text-to-image models with human preference

Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score: Better aligning text-to-image models with human preference. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2096–2105, 2023

2096

[43] [43]

A delay-robust method for enhanced real-time reinforcement learning.Neural Networks, 181:106769, 2025

Bo Xia, Haoyuan Sun, Bo Yuan, Zhiheng Li, Bin Liang, and Xueqian Wang. A delay-robust method for enhanced real-time reinforcement learning.Neural Networks, 181:106769, 2025

2025

[44] [44]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

2023

[45] [45]

Using human feedback to fine-tune diffusion models without any reward model

Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

2024

[46] [46]

Floorplan-llama: Aligning architects’ feedback and domain knowledge in architectural floor plan generation

Jun Yin, Pengyu Zeng, Haoyuan Sun, Yuqin Dai, Han Zheng, Miao Zhang, Yachao Zhang, and Shuai Lu. Floorplan-llama: Aligning architects’ feedback and domain knowledge in architectural floor plan generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6640–6662, 2025. 11

2025

[47] [47]

Ai-empowered prediction of office building energy use from single-view conceptual images for early-stage design.Applied Energy, 406:127289, 2026

Jun Yin, Pengyu Zeng, Yujian Huang, Haoyuan Sun, Tianze Hao, Shuai Lu, et al. Ai-empowered prediction of office building energy use from single-view conceptual images for early-stage design.Applied Energy, 406:127289, 2026

2026

[48] [48]

Mred-14: A benchmark for low-energy residential floor plan generation with 14 flexible inputs

Pengyu Zeng, Jun Yin, Haoyuan Sun, Yuqin Dai, Maowei Jiang, Miao Zhang, and Shuai Lu. Mred-14: A benchmark for low-energy residential floor plan generation with 14 flexible inputs. InProceedings of the 33rd ACM International Conference on Multimedia, pages 11298–11307, 2025

2025

[49] [49]

Group critical-token policy optimization for autoregressive image generation.arXiv preprint arXiv:2509.22485, 2025

Guohui Zhang, Hu Yu, Xiaoxiao Ma, JingHao Zhang, Yaning Pan, Mingde Yao, Jie Xiao, Linjiang Huang, and Feng Zhao. Group critical-token policy optimization for autoregressive image generation.arXiv preprint arXiv:2509.22485, 2025

arXiv 2025

[50] [50]

Fsim: A feature similarity index for image quality assessment.IEEE transactions on Image Processing, 20(8):2378–2386, 2011

Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image quality assessment.IEEE transactions on Image Processing, 20(8):2378–2386, 2011

2011

[51] [51]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 12

2018