pith. sign in

arxiv: 2606.21498 · v1 · pith:I4HYJILAnew · submitted 2026-06-19 · 💻 cs.AI · cs.LG

Balancing Performance and Diversity in GRPO Autoregressive Text-to-Image Post-Training

Pith reviewed 2026-06-26 14:13 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords GRPOautoregressive text-to-imagef-divergenceJS divergencepolicy optimizationreinforcement learning alignmentgeneration diversity
0
0 comments X

The pith

JS divergence achieves the best trade-off between optimization performance and generation diversity in GRPO-style autoregressive text-to-image alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how reference-policy divergence choices affect token-level updates inside GRPO optimization for autoregressive text-to-image models. It places forward KL, reverse KL, and JS divergence inside a single f-divergence framework and shows that each divergence produces a distinct reshaping of the sampled-token updates. Under the shaping form actually used in training, JS regularization reduces the uniform bias that appears with other divergences while still limiting large policy shifts. Experiments on LlamaGen and Janus-7B confirm that JS yields the strongest or near-strongest scores on standard metrics and simultaneously preserves favorable output diversity.

Core claim

Under the sampled-token shaping form used, JS regularization achieves a favorable trade-off by mitigating uniform bias relative to the reference policy while still discouraging large deviations. Extensive experiments on LlamaGen and Janus-7B show that JS divergence achieves the strongest or highly competitive optimization performance on most evaluation metrics while maintaining favorable generation diversity.

What carries the argument

The unified f-divergence framework (forward KL, reverse KL, JS) applied to token-level updates inside the sampled-token shaping form of GRPO.

If this is right

  • JS divergence yields strongest or near-strongest scores on standard T2I alignment metrics.
  • JS divergence preserves higher generation diversity than the alternatives under the same shaping form.
  • Different f-divergences produce measurably distinct token-level update distributions during GRPO.
  • The choice of divergence is a controllable lever that directly trades off bias toward uniformity against policy deviation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same divergence analysis could be applied to other autoregressive generation tasks that already use GRPO-style updates.
  • If the shaping form changes, the relative advantage of JS may shift, suggesting a need to re-derive the token-level bias for each new shaping rule.
  • The framework supplies a practical recipe for selecting the divergence term when both performance and diversity must be controlled simultaneously.

Load-bearing premise

The unified f-divergence framework and sampled-token shaping form correctly describe how token-level updates actually occur during GRPO training of autoregressive T2I models.

What would settle it

A controlled run on the same models and data where JS divergence fails to match or exceed the other divergences on the majority of metrics or shows clearly lower diversity would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.21498 by Chunru Yang, Hongbo Duan, Jiahua Pei, Xueqian Wang, Yi Liu, Yuanhao Chiang.

Figure 1
Figure 1. Figure 1: Qualitative comparison of Janus-7B variants trained with RKL divergence, FKL divergence [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison of LlamaGen variants trained with RKL divergence, FKL divergence, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training reward dynamics under varying divergence regularization, showing HPS-v2 reward [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Autoregressive text-to-image (T2I) generation has recently advanced rapidly, yet aligning generated images with human preferences remains challenging. GRPO-style online reinforcement learning provides an effective framework; however, existing methods typically treat reference-policy divergence as fixed, despite its direct impact on policy optimization. We study this overlooked factor within a unified f-divergence framework, encompassing forward KL, reverse KL, and JS divergence, for GRPO-style autoregressive T2I alignment. Our systematic theoretical analysis reveals that different divergences reshape token-level updates in distinct ways. In particular, under the sampled-token shaping form used, JS regularization achieves a favorable trade-off by mitigating uniform bias relative to the reference policy while still discouraging large deviations. Extensive experiments on LlamaGen and Janus-7B show that JS divergence achieves the strongest or highly competitive optimization performance on most evaluation metrics while maintaining favorable generation diversity. The code is available at https://github.com/tuoyou-hao/BPD-GRPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies the role of reference-policy divergence in GRPO-style online RL for autoregressive T2I models within a unified f-divergence framework (forward KL, reverse KL, JS). It provides a theoretical analysis of how these divergences reshape token-level updates under a sampled-token shaping form, claiming that JS divergence mitigates uniform bias relative to the reference while discouraging large deviations, yielding a favorable performance-diversity trade-off. Experiments on LlamaGen and Janus-7B show JS achieving the strongest or highly competitive results on most metrics while preserving generation diversity. Code is released.

Significance. If the theoretical reshaping analysis holds and directly accounts for the observed metric improvements, the work supplies a principled basis for selecting regularization in GRPO-style T2I alignment, addressing an overlooked factor in balancing optimization strength against diversity. Credit is due for the code release and for testing on two distinct autoregressive backbones. The contribution would be strengthened by explicit linkage between local token-update predictions and training dynamics.

major comments (2)
  1. [Theoretical analysis] Theoretical analysis section: The central claim that the sampled-token shaping form produces distinct, predictable effects for JS versus KL (mitigating uniform bias while limiting deviations) is load-bearing for explaining the experimental trade-off. The analysis appears to rely on assumptions such as token independence and fixed reference behavior; these may not hold under sequential autoregressive sampling and online reward feedback in GRPO. Direct validation is needed that the derived token-update directions match observed gradients during actual training steps on LlamaGen or Janus-7B.
  2. [Experiments] Experiments section: The claim that JS divergence yields the strongest or highly competitive optimization performance on most metrics rests on the theory-to-practice mapping. Without reported ablations or correlation analysis showing that the predicted bias mitigation (rather than other implementation choices) drives the metric gains across both models, alternative explanations for the results cannot be ruled out.
minor comments (2)
  1. [Abstract] Abstract and introduction: Expand the description of the sampled-token shaping form and the precise definition of uniform bias so that readers can follow the theoretical claims without first reading the full derivation.
  2. [Experiments] Evaluation tables: If multiple random seeds were used, report standard deviations or error bars alongside the metric values to allow assessment of whether reported differences are statistically meaningful.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the linkage between theory and experiments. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Theoretical analysis] Theoretical analysis section: The central claim that the sampled-token shaping form produces distinct, predictable effects for JS versus KL (mitigating uniform bias while limiting deviations) is load-bearing for explaining the experimental trade-off. The analysis appears to rely on assumptions such as token independence and fixed reference behavior; these may not hold under sequential autoregressive sampling and online reward feedback in GRPO. Direct validation is needed that the derived token-update directions match observed gradients during actual training steps on LlamaGen or Janus-7B.

    Authors: We agree the analysis employs simplifying assumptions (token independence and fixed reference) to obtain closed-form token-update expressions under the sampled-token shaping form. These are standard for isolating divergence effects and yield qualitative predictions that align with the observed performance-diversity trade-offs. We will revise the theoretical section to explicitly list the assumptions and discuss their scope in autoregressive GRPO with online rewards. For validation, we will add a new analysis subsection reporting correlation between predicted update directions and observed per-token gradient statistics from training runs on both backbones. revision: partial

  2. Referee: [Experiments] Experiments section: The claim that JS divergence yields the strongest or highly competitive optimization performance on most metrics rests on the theory-to-practice mapping. Without reported ablations or correlation analysis showing that the predicted bias mitigation (rather than other implementation choices) drives the metric gains across both models, alternative explanations for the results cannot be ruled out.

    Authors: The experimental protocol holds all GRPO components fixed while varying only the divergence (forward KL, reverse KL, JS) on both LlamaGen and Janus-7B; this isolates the divergence choice as the source of differences. To further tie results to the predicted bias mitigation, we will add correlation analysis between measured deviation from uniform/reference behavior and metric gains across runs. revision: partial

Circularity Check

0 steps flagged

No circularity: theoretical analysis presented as independent derivation from f-divergence framework

full rationale

The abstract describes a systematic theoretical analysis within a unified f-divergence framework that reveals how divergences reshape token-level updates under sampled-token shaping. No equations, self-citations, or fitted parameters are shown that would make any prediction equivalent to its inputs by construction. The JS divergence trade-off claim is framed as an outcome of the analysis rather than a renaming, self-definition, or load-bearing self-citation. The derivation chain remains self-contained against external benchmarks with no reduction to fitted inputs or prior author work invoked as uniqueness theorem.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The f-divergence framework is treated as standard background.

pith-pipeline@v0.9.1-grok · 5716 in / 1041 out tokens · 19875 ms · 2026-06-26T14:13:14.860129+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 8 linked inside Pith

  1. [1]

    Automatic thresholding of gray-level pictures using two-dimensional entropy.Computer vision, graphics, and image processing, 47(1):22–32, 1989

    Ahmed S Abutaleb. Automatic thresholding of gray-level pictures using two-dimensional entropy.Computer vision, graphics, and image processing, 47(1):22–32, 1989

  2. [2]

    Entropy-aware preference alignment for diffusion- based text-to-image generation

    Hannan Bai, Haoyuan Sun, and Yuncheng Du. Entropy-aware preference alignment for diffusion- based text-to-image generation. InChinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 373–387. Springer, 2025

  3. [3]

    Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature.Geoscientific model development, 7 (3):1247–1250, 2014

    Tianfeng Chai and Roland R Draxler. Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature.Geoscientific model development, 7 (3):1247–1250, 2014

  4. [4]

    On information-type measure of difference of probability distributions and indirect observations.Studia Sci

    Imre Csiszár. On information-type measure of difference of probability distributions and indirect observations.Studia Sci. Math. Hungar., 2:299–318, 1967

  5. [5]

    Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

  6. [6]

    Viss-r1: Self-supervised reinforcement video reasoning

    Bo Fang, Yuxin Song, Haoyuan Sun, Xinyao Zhang, Qiangqiang Wu, Wenhao Wu, and Antoni B Chan. Viss-r1: Self-supervised reinforcement video reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11190–11200, 2026

  7. [7]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  8. [8]

    Aligning language models with preferences through f-divergence minimization

    Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, and Marc Dymetman. Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215, 2023

  9. [9]

    Generative adversarial nets.Advances in neural information processing systems, 27, 2014

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014

  10. [10]

    Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

  11. [11]

    Scope of validity of psnr in image/video quality assessment.Electronics letters, 44(13):800–801, 2008

    Quan Huynh-Thu and Mohammed Ghanbari. Scope of validity of psnr in image/video quality assessment.Electronics letters, 44(13):800–801, 2008

  12. [12]

    On information and sufficiency.The annals of mathematical statistics, 22(1):79–86, 1951

    Solomon Kullback and Richard A Leibler. On information and sufficiency.The annals of mathematical statistics, 22(1):79–86, 1951

  13. [13]

    Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

    Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, et al. Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

  14. [14]

    Align- ing diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024

    Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Align- ing diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024

  15. [15]

    Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2(5):7, 2024

    Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, and Liang Zheng. Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2(5):7, 2024. 9

  16. [16]

    Divergence measures based on the shannon entropy.IEEE Transactions on Information theory, 37(1):145–151, 1991

    Jianhua Lin. Divergence measures based on the shannon entropy.IEEE Transactions on Information theory, 37(1):145–151, 1991

  17. [17]

    Evaluating text-to-visual generation with image-to-text generation

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. In European Conference on Computer Vision, pages 366–384. Springer, 2024

  18. [18]

    Sample by step, optimize by chunk: Chunk-level grpo for text-to-image generation.arXiv preprint arXiv:2510.21583, 2025

    Yifu Luo, Penghui Du, Bo Li, Sinan Du, Tiantian Zhang, Yongzhe Chang, Kai Wu, Kun Gai, and Xueqian Wang. Sample by step, optimize by chunk: Chunk-level grpo for text-to-image generation.arXiv preprint arXiv:2510.21583, 2025

  19. [19]

    Reinforcement learning meets masked generative models: Mask- grpo for text-to-image generation.Advances in Neural Information Processing Systems, 38: 108460–108485, 2026

    Yifu Luo, Xinhao Hu, Keyu Fan, Haoyuan Sun, Zeyu Chen, Bo Xia, Tiantian Zhang, Yongzhe Chang, and Xueqian Wang. Reinforcement learning meets masked generative models: Mask- grpo for text-to-image generation.Advances in Neural Information Processing Systems, 38: 108460–108485, 2026

  20. [20]

    The hidden link between rlhf and contrastive learning.arXiv preprint arXiv:2506.22578, 2025

    Xufei Lv, Kehai Chen, Haoyuan Sun, Xuefeng Bai, Min Zhang, and Houde Liu. The hidden link between rlhf and contrastive learning.arXiv preprint arXiv:2506.22578, 2025

  21. [21]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  22. [22]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  23. [23]

    Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  24. [24]

    A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

    Claude Elwood Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

  25. [25]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  26. [26]

    Cologen: Progressive learning of concept-localization duality for unified image generation

    Yuxin Song, Yu Lu, Haoyuan Sun, Huanjin Yao, Fanglong Liu, Yifan Sun, Haocheng Feng, Hang Zhou, and Jingdong Wang. Cologen: Progressive learning of concept-localization duality for unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14724–14734, 2026

  27. [27]

    Generalizing offline alignment theoretical paradigm with diverse divergence constraints

    Haoyuan Sun, Yuxin Zheng, Yifei Zhao, Yongzhe Chang, and Xueqian Wang. Generalizing offline alignment theoretical paradigm with diverse divergence constraints. InICML 2024 Workshop on Models of Human Feedback for AI Alignment, 2024

  28. [28]

    Reinforcement fine-tuning powers reasoning capability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

    Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, and Xueqian Wang. Reinforcement fine-tuning powers reasoning capability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

  29. [29]

    Generalizing alignment paradigm of text-to-image generation with preferences through f-divergence minimization

    Haoyuan Sun, Bo Xia, Yongzhe Chang, and Xueqian Wang. Generalizing alignment paradigm of text-to-image generation with preferences through f-divergence minimization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27644–27652, 2025

  30. [30]

    Identical human prefer- ence alignment paradigm for text-to-image models

    Haoyuan Sun, Bo Xia, Yifei Zhao, Yongzhe Chang, and Xueqian Wang. Identical human prefer- ence alignment paradigm for text-to-image models. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025

  31. [31]

    Positive enhanced preference alignment for text-to-image models

    Haoyuan Sun, Bo Xia, Yifei Zhao, Yongzhe Chang, and Xueqian Wang. Positive enhanced preference alignment for text-to-image models. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025. 10

  32. [32]

    Power reinforcement post-training of text-to-image models with super-linear advantage shaping.arXiv preprint arXiv:2605.10937, 2026

    Haoyuan Sun, Jing Wang, Yuxin Song, Yu Lu, Bo Fang, Yifu Luo, Jun Yin, Pengyu Zeng, Miao Zhang, Tiantian Zhang, et al. Power reinforcement post-training of text-to-image models with super-linear advantage shaping.arXiv preprint arXiv:2605.10937, 2026

  33. [33]

    Calibration enhanced decision maker: Towards trustworthy sequential decision-making with large sequence models

    Haoyuan Sun, Bo Xia, Yifu Luo, Tiantian Zhang, and Xueqian Wang. Calibration enhanced decision maker: Towards trustworthy sequential decision-making with large sequence models. Transactions on Machine Learning Research, 2026

  34. [34]

    Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

  35. [35]

    Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

  36. [36]

    Policy gradient methods for reinforcement learning with function approximation and action-dependent baselines.arXiv preprint arXiv:1706.06643, 2017

    Philip S Thomas and Emma Brunskill. Policy gradient methods for reinforcement learning with function approximation and action-dependent baselines.arXiv preprint arXiv:1706.06643, 2017

  37. [37]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

  38. [38]

    Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints

    Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, and Yuxin Chen. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints. InInternational Conference on Learning Representations, volume 2024, pages 10450–10480, 2024

  39. [39]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4): 600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4): 600–612, 2004

  40. [40]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

  41. [41]

    Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  42. [42]

    Human preference score: Better aligning text-to-image models with human preference

    Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score: Better aligning text-to-image models with human preference. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2096–2105, 2023

  43. [43]

    A delay-robust method for enhanced real-time reinforcement learning.Neural Networks, 181:106769, 2025

    Bo Xia, Haoyuan Sun, Bo Yuan, Zhiheng Li, Bin Liang, and Xueqian Wang. A delay-robust method for enhanced real-time reinforcement learning.Neural Networks, 181:106769, 2025

  44. [44]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

  45. [45]

    Using human feedback to fine-tune diffusion models without any reward model

    Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

  46. [46]

    Floorplan-llama: Aligning architects’ feedback and domain knowledge in architectural floor plan generation

    Jun Yin, Pengyu Zeng, Haoyuan Sun, Yuqin Dai, Han Zheng, Miao Zhang, Yachao Zhang, and Shuai Lu. Floorplan-llama: Aligning architects’ feedback and domain knowledge in architectural floor plan generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6640–6662, 2025. 11

  47. [47]

    Ai-empowered prediction of office building energy use from single-view conceptual images for early-stage design.Applied Energy, 406:127289, 2026

    Jun Yin, Pengyu Zeng, Yujian Huang, Haoyuan Sun, Tianze Hao, Shuai Lu, et al. Ai-empowered prediction of office building energy use from single-view conceptual images for early-stage design.Applied Energy, 406:127289, 2026

  48. [48]

    Mred-14: A benchmark for low-energy residential floor plan generation with 14 flexible inputs

    Pengyu Zeng, Jun Yin, Haoyuan Sun, Yuqin Dai, Maowei Jiang, Miao Zhang, and Shuai Lu. Mred-14: A benchmark for low-energy residential floor plan generation with 14 flexible inputs. InProceedings of the 33rd ACM International Conference on Multimedia, pages 11298–11307, 2025

  49. [49]

    Group critical-token policy optimization for autoregressive image generation.arXiv preprint arXiv:2509.22485, 2025

    Guohui Zhang, Hu Yu, Xiaoxiao Ma, JingHao Zhang, Yaning Pan, Mingde Yao, Jie Xiao, Linjiang Huang, and Feng Zhao. Group critical-token policy optimization for autoregressive image generation.arXiv preprint arXiv:2509.22485, 2025

  50. [50]

    Fsim: A feature similarity index for image quality assessment.IEEE transactions on Image Processing, 20(8):2378–2386, 2011

    Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image quality assessment.IEEE transactions on Image Processing, 20(8):2378–2386, 2011

  51. [51]

    The unrea- sonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 12