pith. sign in

arxiv: 2605.17602 · v2 · pith:CHJMJXHWnew · submitted 2026-05-17 · 💻 cs.AI · cs.CV· cs.LG

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

Pith reviewed 2026-05-22 09:16 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LG
keywords text-to-image generationreward modelvision-language modelrubric learningpreference alignmentinterpretable evaluationreinforcement learningdata-efficient training
0
0 comments X

The pith

AutoRubric-T2I turns tiny preference data into a compact set of explicit rubrics that let VLMs judge text-to-image alignment more accurately than trained reward models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AutoRubric-T2I as a framework that automatically creates and selects clear scoring rules for vision-language models to evaluate generated images against their prompts. It starts by turning human preference pairs into candidate rubrics through synthesized reasoning traces, then has the VLM assign scores to image pairs under each rule. An L1-regularized logistic regression step then keeps only the most useful rubrics, removing noise and redundancy. This approach produces reliable reward signals while using less than 0.01 percent of the usual annotated preference data, and it outperforms standard reward-model baselines on benchmarks like MMRB2. The same rubrics also serve as rewards in reinforcement learning pipelines to improve final image quality on downstream tasks.

Core claim

AutoRubric-T2I first synthesizes reasoning traces from preference pairs into candidate rubrics, scores paired images under each rubric with a VLM to obtain pairwise differences, and applies an L1-regularized logistic regression refiner to select the top-N most discriminative rubrics. The resulting compact, interpretable rule set yields high-quality reward signals from under 0.01 percent of typical annotated data, outperforms strong reward-model baselines on MMRB2, and improves generation quality when used as an RL reward in pipelines such as Flow-GRPO on diffusion models for tasks including TIIF and UniGenBench++.

What carries the argument

The AutoRubric-T2I pipeline that converts preference pairs into candidate rubrics via synthesized reasoning traces and refines them with L1-regularized logistic regression to produce a small set of explicit, discriminative rules for VLM-based scoring.

If this is right

  • Reward models for text-to-image alignment can be built from orders-of-magnitude less human preference data than current practice.
  • Evaluation criteria become explicit and human-readable, allowing inspection and editing of the rules that drive scoring.
  • The same learned rubric set can be reused across different vision-language models without retraining the reward component.
  • Reinforcement learning loops for diffusion models achieve higher generation quality when guided by these rubric-based rewards instead of scalar models.
  • The data-efficiency gain opens the possibility of rapidly adapting alignment signals to new domains or user populations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rubric-learning approach could be tested for transfer to other generative modalities such as video or audio where preference data is also scarce.
  • Examining the final selected rubrics might surface which visual attributes humans weigh most heavily when judging prompt alignment.
  • One could measure whether the same rubrics remain effective when the underlying VLM is replaced by a newer or differently trained model.
  • The method suggests a route to hybrid systems that combine the speed of trained scalar rewards with the transparency of rule-based judges.

Load-bearing premise

The synthesized reasoning traces from preference pairs can be turned into rubrics whose VLM-derived scores reliably reflect the original human preferences, and the L1-regularized logistic regression step selects a set of rubrics that generalize beyond the data used to create them.

What would settle it

Running the full pipeline on a fresh human preference dataset and finding that the selected rubrics produce VLM scores whose agreement with human judgments falls to chance level or below the agreement achieved by a standard scalar reward model trained on the same data.

Figures

Figures reproduced from arXiv: 2605.17602 by Cho-Jui Hsieh, Daixuan Huo, Kuei-Chun Kao, Yuanhao Ban.

Figure 1
Figure 1. Figure 1: Reward hacking in scalar reward optimization. HPSv3 optimization attains a high scalar reward while violating prompt-specific constraints, whereas AutoRubric-T2I favors the rubric-aligned generation. 4 Methodology In this section, we introduce AutoRubric-T2I. Section 4.1 formulates rubric learning as an infinite￾dimensional sparse logistic regression problem and motivates a working-set optimization strateg… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of AutoRubric-T2I. Our framework first constructs a seed rubric pool through diversity￾aware seed selection and rubric generation. It then iteratively scores training pairs, selects discriminative rubrics with sparse logistic regression, mines hard pairs, and proposes new rubrics to refine the final weighted rubric set. 4.2 Detailed Procedure We now describe the practical pipeline that instantiate… view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics of scalar and rubric-based T2I rewards [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The evolution of generation quality of RL using AutoRubrics and other scalar reward models. The visual quality of scalar reward models degrades notably while the reward increases. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of downstream T2I RL policies. AutoRubric-T2I better preserves prompt￾specific objects, relations, and fine-grained details compared with the base model, scalar reward optimization, and AutoRule-based rubric rewards. J Limitations and Broader Impact Domain specificity of learned weights. The ℓ1-regularized weights are fit to the preference distribution of the training corpus (e.g., H… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples from downstream RL fine-tuning with AutoRubric-T2I rewards. Each row shows a text prompt and the corresponding generated image, demonstrating improved prompt alignment, object placement, attribute accuracy, and overall visual quality after RL training. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Seed rubric generation, stage 1: vision reasoner that produces a step-by-step preference rationale for each image pair. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Seed rubric generation, stages 2-3: rule extractor rule merger [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: VLM judge templates: Yes/No binary scoring. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Hard-pair refinement prompt. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Screenshot of the human evaluation survey interface. Annotators were asked to choose the [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Optimized rubric set for Qwen-3-VL-8B trained on HPSv3 preference pairs (round 3) [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Optimized rubric set for Qwen-3-VL-8B trained on PickScore preference pairs (round 6). 26 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Optimized rubric set for Qwen-3-VL-32B trained on HPSv3 preference pairs (round 3) [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Optimized rubric set for Qwen-3-VL-32B trained on PickScore preference pairs (round 6). 27 [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
read the original abstract

Aligning Text-to-Image (T2I) generation models with human preferences increasingly relies on image reward models that score or rank generated images according to prompt alignment and perceptual quality. Existing reward models are commonly trained as Bradley-Terry (BT) preference models on large-scale human preference corpora, making them costly to train, difficult to adapt, and opaque in their evaluation criteria. Meanwhile, Vision-Language Model (VLM) judges can provide more fine-grained assessments through textual rubrics, but their manually designed or heuristically generated scoring rules may fail to reliably reflect human preferences. In this paper, we propose AutoRubric-T2I, the first rubric learning framework in T2I that automatically synthesizes and selects explicit rubrics for guiding VLM judges. AutoRubric-T2I first synthesizes reasoning traces from preference pairs into candidate rubrics, then uses a VLM judge to score paired images under each rubric, producing pairwise rubric-score differences for preference learning. To remove noisy and redundant rules, we further employ a $\ell_1$-Regularized Logistic Regression Refiner, which selects the Top-$N$ most discriminative rubrics. Extensive evaluations show that AutoRubric-T2I produces high-quality, interpretable reward signals using less than 0.01% of the annotated preference data, substantially reducing the need for large-scale reward-model training. On image reward benchmarks such as MMRB2, AutoRubric-T2I outperforms strong reward model baselines. We further validate AutoRubric-T2I as an RL reward on downstream T2I tasks, including TIIF and UniGenBench++, where it improves generation quality over scalar reward models using the Flow-GRPO pipeline on diffusion models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents AutoRubric-T2I, a framework for automatically learning explicit rubrics to guide Vision-Language Model (VLM) judges in evaluating text-to-image (T2I) generations. It synthesizes reasoning traces from a small subset (<0.01%) of human preference pairs into candidate rubrics, scores paired images using a VLM under each rubric, and applies an ℓ1-regularized logistic regression to select the top-N most discriminative rubrics. The approach claims to produce high-quality, interpretable reward signals that outperform strong baselines on benchmarks like MMRB2 while improving downstream T2I generation quality in RL fine-tuning, all with drastically reduced data requirements.

Significance. If validated, this work offers a significant advancement in making reward models for T2I alignment more data-efficient, interpretable, and adaptable compared to traditional Bradley-Terry models trained on large corpora. By leveraging VLM judges with learned rubrics, it addresses opacity in existing reward models and reduces the cost of large-scale preference data collection. The potential for rule-based, human-aligned evaluation could influence future work in multimodal alignment and RL for generative models.

major comments (2)
  1. [§3.2–3.3] §3.2–3.3 (rubric scoring and L1 refiner): The logistic regression is fit directly to pairwise score differences computed from the identical small preference subset used to synthesize the candidate rubrics in §3.1. This creates a dependence that risks selecting rubrics exploiting VLM-specific artifacts or in-sample correlations rather than generalizable rules; no held-out rubric validation or causal test is described to rule out overfitting.
  2. [§4] §4 (experiments): Claims of outperforming baselines on MMRB2 and improving downstream RL tasks (TIIF, UniGenBench++) with <0.01% data are stated without reported numerical values, error bars, ablation on Top-N or regularization strength, or controls for systematic bias in the VLM judge itself. These omissions make it impossible to verify that the reported gains are robust rather than artifacts of the synthesis set.
minor comments (2)
  1. [§3.3] The free parameters (Top-N, L1 strength) are mentioned but their selection procedure or sensitivity analysis is not detailed; add a short paragraph or table showing how they were chosen.
  2. [§3.2] Notation for rubric-score differences and the logistic target should be formalized with an equation to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of our method and results.

read point-by-point responses
  1. Referee: [§3.2–3.3] §3.2–3.3 (rubric scoring and L1 refiner): The logistic regression is fit directly to pairwise score differences computed from the identical small preference subset used to synthesize the candidate rubrics in §3.1. This creates a dependence that risks selecting rubrics exploiting VLM-specific artifacts or in-sample correlations rather than generalizable rules; no held-out rubric validation or causal test is described to rule out overfitting.

    Authors: We agree that fitting the L1-regularized logistic regression on the same small subset used for rubric synthesis introduces a risk of selecting rules that capture in-sample correlations or VLM-specific patterns. The synthesis step generates candidate rubrics from reasoning traces, but the subsequent scoring and selection occur on the identical pairs. To address this, we have added a held-out validation procedure in the revised §3.3: after rubric selection on the synthesis subset, we evaluate the selected rubrics on a disjoint held-out portion of the preference data and report the resulting preference prediction accuracy. We also include a brief causal-style check by measuring rubric stability across different random splits of the synthesis set. These additions are now described in the revised manuscript. revision: yes

  2. Referee: [§4] §4 (experiments): Claims of outperforming baselines on MMRB2 and improving downstream RL tasks (TIIF, UniGenBench++) with <0.01% data are stated without reported numerical values, error bars, ablation on Top-N or regularization strength, or controls for systematic bias in the VLM judge itself. These omissions make it impossible to verify that the reported gains are robust rather than artifacts of the synthesis set.

    Authors: The original submission emphasized relative improvements and data efficiency but did not include the full set of quantitative results, error bars, or ablations requested. We have revised §4 to include: (i) exact numerical scores and standard deviations on MMRB2 for AutoRubric-T2I versus the strongest baselines, (ii) ablation tables varying Top-N and the L1 regularization coefficient, (iii) error bars across three random seeds for both reward-model and downstream RL experiments, and (iv) a control experiment that replaces the learned rubrics with a fixed generic VLM prompt to isolate the contribution of the selected rubrics from any systematic VLM bias. These results are now reported with the corresponding tables and figures in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline with external validation

full rationale

The paper presents an empirical framework that synthesizes candidate rubrics from a small subset of preference pairs, scores them via an external VLM, and applies L1-regularized logistic regression for selection. This process is a standard data-driven feature selection step within a proposed method, not a first-principles derivation or prediction that reduces to its inputs by construction. Performance is assessed on separate benchmarks (MMRB2) and downstream RL tasks, with no load-bearing self-citations, uniqueness theorems, or self-definitional equations identified in the described chain. The approach remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Information is limited to the abstract; the method relies on a small number of hyperparameters for the refiner and on the domain assumption that VLM rubric scoring tracks human preference.

free parameters (2)
  • Top-N
    Number of rubrics retained after L1-regularized logistic regression; chosen to balance discriminativeness and noise.
  • L1 regularization strength
    Controls sparsity in rubric selection; fitted or tuned on the preference-derived score differences.
axioms (1)
  • domain assumption Reasoning traces extracted from preference pairs can be converted into explicit rubrics that, when scored by a VLM, produce differences correlated with the original human judgments.
    This premise is required for the synthesis step to produce useful candidate rubrics.

pith-pipeline@v0.9.0 · 5863 in / 1607 out tokens · 42905 ms · 2026-05-22T09:16:39.416748+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 9 internal anchors

  1. [1]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

  2. [2]

    Odin: Disentangled reward mitigates hacking in rlhf.arXiv preprint arXiv:2402.07319, 2024

    Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mehrdad Farajtabar, and Hongyang Li. Odin: Disentangled reward mitigates hacking in rlhf.arXiv preprint arXiv:2402.07319, 2024

  3. [3]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  4. [4]

    Rubricrl: Simple generalizable rewards for text-to-image generation.arXiv preprint arXiv:2511.20651, 2025

    Xuelu Feng, Yunsheng Li, Ziyu Wan, Zixuan Gao, Junsong Yuan, Dongdong Chen, and Chunming Qiao. Rubricrl: Simple generalizable rewards for text-to-image generation.arXiv preprint arXiv:2511.20651, 2025

  5. [5]

    Gemini 3 system card

    Google DeepMind. Gemini 3 system card. https://deepmind.google/technologies/ gemini/, 2025. Accessed: 2026-04-23

  6. [6]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Runxin Zhang, Runze Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  8. [8]

    Understanding reward hacking in text-to-image reinforcement learning.arXiv preprint arXiv:2601.03468, 2026

    Yunqi Hong, Kuei-Chun Kao, Hengguang Zhou, and Cho-Jui Hsieh. Understanding reward hacking in text-to-image reinforcement learning.arXiv preprint arXiv:2601.03468, 2026

  9. [9]

    Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A. Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering.arXiv preprint arXiv:2303.11897, 2023

  10. [10]

    Multimodal rewardbench 2: Evaluating omni reward models for interleaved text and image.arXiv preprint arXiv:2512.16899, 2025

    Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer, and Marjan Ghazvininejad. Multimodal rewardbench 2: Evaluating omni reward models for interleaved text and image.arXiv preprint arXiv:2512.16899, 2025

  11. [11]

    Reinforcement learning with rubric anchors

    Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, et al. Reinforcement learning with rubric anchors. arXiv preprint arXiv:2508.12790, 2025

  12. [12]

    Orthogonal matching pursuit with replacement

    Prateek Jain, Ambuj Tewari, and Inderjit Dhillon. Orthogonal matching pursuit with replacement. Advances in neural information processing systems, 24, 2011. 10

  13. [13]

    Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

    Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, and Wenhu Chen. Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

  14. [14]

    T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

    Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng- Ann Heng, and Hongsheng Li. T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

  15. [15]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

  16. [16]

    Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

    Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, and Deva Ramanan. Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

  17. [17]

    Rubrichub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation.arXiv preprint arXiv:2601.08430, 2026

    Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, and Wei Chen. Rubrichub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation.arXiv preprint arXiv:2601.08430, 2026

  18. [18]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  19. [19]

    Openrubrics: Contrastive rubric generation for reward models.arXiv preprint arXiv:2505.14826, 2025

    Zhen Liu, Yixin Wang, Jianfei Chen, and Jun Zhu. Openrubrics: Contrastive rubric generation for reward models.arXiv preprint arXiv:2505.14826, 2025

  20. [20]

    Hpsv3: Towards wide-spectrum hu- man preference score

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum hu- man preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

  21. [21]

    Or- thogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition

    Yagyensh Chandra Pati, Ramin Rezaiifar, and Perinkulam Sambamurthy Krishnaprasad. Or- thogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. InProceedings of 27th Asilomar conference on signals, systems and computers, pages 40–44. IEEE, 1993

  22. [22]

    Onlinerubrics: Dynamic rubric elicitation for online reinforcement learning.arXiv preprint arXiv:2507.09832, 2025

    Keivan Rezaei, Xuechen He, and Percy Liang. Onlinerubrics: Dynamic rubric elicitation for online reinforcement learning.arXiv preprint arXiv:2507.09832, 2025

  23. [23]

    Rrd: Recursive rubric decomposition for scalable reward modeling.arXiv preprint arXiv:2601.05743, 2026

    Yifan Shen, Xiang Li, Wei Zhang, and Yang Liu. Rrd: Recursive rubric decomposition for scalable reward modeling.arXiv preprint arXiv:2601.05743, 2026

  24. [24]

    Autorule: Reasoning chain-of-thought extracted rule-based rewards improve preference learning.arXiv preprint arXiv:2506.15651, 2025

    Tevin Wang and Chenyan Xiong. Autorule: Reasoning chain-of-thought extracted rule-based rewards improve preference learning.arXiv preprint arXiv:2506.15651, 2025

  25. [25]

    Unigenbench++: A unified semantic evaluation benchmark for text-to-image generation.arXiv preprint arXiv:2510.18701, 2025

    Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, et al. Unigenbench++: A unified semantic evaluation benchmark for text-to-image generation.arXiv preprint arXiv:2510.18701, 2025

  26. [26]

    Unified Reward Model for Multimodal Understanding and Generation

    Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025

  27. [27]

    Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025

    Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, and Lei Zhang. Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025

  28. [28]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

    Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, et al. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025

  29. [29]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  30. [30]

    Rewarddance: Scaling visual reward modeling via generative next-token prediction.arXiv preprint arXiv:2504.12345, 2025

    Xiaoshi Wu, Yiming Li, Keqiang Zhang, and Hongsheng Li. Rewarddance: Scaling visual reward modeling via generative next-token prediction.arXiv preprint arXiv:2504.12345, 2025. 11

  31. [31]

    Auto-rubric: Learning from implicit weights to explicit rubrics for reward modeling.arXiv preprint arXiv:2510.17314, 2025

    Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, et al. Auto-rubric: Learning from implicit weights to explicit rubrics for reward modeling.arXiv preprint arXiv:2510.17314, 2025

  32. [32]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

  33. [33]

    Alternating reinforcement learning for rubric-based reward modeling in non-verifiable llm post-training.arXiv preprint arXiv:2602.01511, 2026

    Ran Xu, Tianci Liu, Zihan Dong, Tony Yu, Ilgee Hong, Carl Yang, Linjun Zhang, Tao Zhao, and Haoyu Wang. Alternating reinforcement learning for rubric-based reward modeling in non-verifiable llm post-training.arXiv preprint arXiv:2602.01511, 2026

  34. [34]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025

  35. [35]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  36. [36]

    Automated filtering of human feedback data for aligning text-to-image diffusion models.arXiv preprint arXiv:2410.10166, 2024

    Yongjin Yang, Sihyeon Kim, Hojung Jung, Sangmin Bae, SangMook Kim, Se-Young Yun, and Kimin Lee. Automated filtering of human feedback data for aligning text-to-image diffusion models.arXiv preprint arXiv:2410.10166, 2024

  37. [37]

    Sparse random feature algorithm as coordinate descent in hilbert space.Advances in Neural Information Processing Systems, 27, 2014

    Ian E Yen, Ting-Wei Lin, Shou-De Lin, Pradeep Ravikumar, and Inderjit S Dhillon. Sparse random feature algorithm as coordinate descent in hilbert space.Advances in Neural Information Processing Systems, 27, 2014

  38. [38]

    Does this image satisfy this rule?

    Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, and Lifeng Jin. Chasing the tail: Effective rubric- based reward modeling for large language model post-training.arXiv preprint arXiv:2509.21500, 2026. 12 Table of Contents of Appendix A AutoRubric-T2I Pipeline Algorithm 14 B RL...

  39. [39]

    thinking-with-images

    [10] is a recent omni reward-model benchmark covering four subtasks—text-to-image,image editing,interleaved generation, andmultimodal reasoning(“thinking-with-images”)—with 1,000 expert-annotated preference pairs per subtask drawn from 23 frontier models across 21 source tasks. Generative T2I Benchmarks.For T2I generative quality assessment on RL post-tra...