AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

Cho-Jui Hsieh; Daixuan Huo; Kuei-Chun Kao; Yuanhao Ban

arxiv: 2605.17602 · v2 · pith:CHJMJXHWnew · submitted 2026-05-17 · 💻 cs.AI · cs.CV· cs.LG

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

Kuei-Chun Kao , Daixuan Huo , Yuanhao Ban , Cho-Jui Hsieh This is my paper

Pith reviewed 2026-05-22 09:16 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LG

keywords text-to-image generationreward modelvision-language modelrubric learningpreference alignmentinterpretable evaluationreinforcement learningdata-efficient training

0 comments

The pith

AutoRubric-T2I turns tiny preference data into a compact set of explicit rubrics that let VLMs judge text-to-image alignment more accurately than trained reward models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AutoRubric-T2I as a framework that automatically creates and selects clear scoring rules for vision-language models to evaluate generated images against their prompts. It starts by turning human preference pairs into candidate rubrics through synthesized reasoning traces, then has the VLM assign scores to image pairs under each rule. An L1-regularized logistic regression step then keeps only the most useful rubrics, removing noise and redundancy. This approach produces reliable reward signals while using less than 0.01 percent of the usual annotated preference data, and it outperforms standard reward-model baselines on benchmarks like MMRB2. The same rubrics also serve as rewards in reinforcement learning pipelines to improve final image quality on downstream tasks.

Core claim

AutoRubric-T2I first synthesizes reasoning traces from preference pairs into candidate rubrics, scores paired images under each rubric with a VLM to obtain pairwise differences, and applies an L1-regularized logistic regression refiner to select the top-N most discriminative rubrics. The resulting compact, interpretable rule set yields high-quality reward signals from under 0.01 percent of typical annotated data, outperforms strong reward-model baselines on MMRB2, and improves generation quality when used as an RL reward in pipelines such as Flow-GRPO on diffusion models for tasks including TIIF and UniGenBench++.

What carries the argument

The AutoRubric-T2I pipeline that converts preference pairs into candidate rubrics via synthesized reasoning traces and refines them with L1-regularized logistic regression to produce a small set of explicit, discriminative rules for VLM-based scoring.

If this is right

Reward models for text-to-image alignment can be built from orders-of-magnitude less human preference data than current practice.
Evaluation criteria become explicit and human-readable, allowing inspection and editing of the rules that drive scoring.
The same learned rubric set can be reused across different vision-language models without retraining the reward component.
Reinforcement learning loops for diffusion models achieve higher generation quality when guided by these rubric-based rewards instead of scalar models.
The data-efficiency gain opens the possibility of rapidly adapting alignment signals to new domains or user populations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rubric-learning approach could be tested for transfer to other generative modalities such as video or audio where preference data is also scarce.
Examining the final selected rubrics might surface which visual attributes humans weigh most heavily when judging prompt alignment.
One could measure whether the same rubrics remain effective when the underlying VLM is replaced by a newer or differently trained model.
The method suggests a route to hybrid systems that combine the speed of trained scalar rewards with the transparency of rule-based judges.

Load-bearing premise

The synthesized reasoning traces from preference pairs can be turned into rubrics whose VLM-derived scores reliably reflect the original human preferences, and the L1-regularized logistic regression step selects a set of rubrics that generalize beyond the data used to create them.

What would settle it

Running the full pipeline on a fresh human preference dataset and finding that the selected rubrics produce VLM scores whose agreement with human judgments falls to chance level or below the agreement achieved by a standard scalar reward model trained on the same data.

Figures

Figures reproduced from arXiv: 2605.17602 by Cho-Jui Hsieh, Daixuan Huo, Kuei-Chun Kao, Yuanhao Ban.

**Figure 1.** Figure 1: Reward hacking in scalar reward optimization. HPSv3 optimization attains a high scalar reward while violating prompt-specific constraints, whereas AutoRubric-T2I favors the rubric-aligned generation. 4 Methodology In this section, we introduce AutoRubric-T2I. Section 4.1 formulates rubric learning as an infinitedimensional sparse logistic regression problem and motivates a working-set optimization strateg… view at source ↗

**Figure 2.** Figure 2: Overview of AutoRubric-T2I. Our framework first constructs a seed rubric pool through diversityaware seed selection and rubric generation. It then iteratively scores training pairs, selects discriminative rubrics with sparse logistic regression, mines hard pairs, and proposes new rubrics to refine the final weighted rubric set. 4.2 Detailed Procedure We now describe the practical pipeline that instantiate… view at source ↗

**Figure 3.** Figure 3: Training dynamics of scalar and rubric-based T2I rewards [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: The evolution of generation quality of RL using AutoRubrics and other scalar reward models. The visual quality of scalar reward models degrades notably while the reward increases. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of downstream T2I RL policies. AutoRubric-T2I better preserves promptspecific objects, relations, and fine-grained details compared with the base model, scalar reward optimization, and AutoRule-based rubric rewards. J Limitations and Broader Impact Domain specificity of learned weights. The ℓ1-regularized weights are fit to the preference distribution of the training corpus (e.g., H… view at source ↗

**Figure 6.** Figure 6: Qualitative examples from downstream RL fine-tuning with AutoRubric-T2I rewards. Each row shows a text prompt and the corresponding generated image, demonstrating improved prompt alignment, object placement, attribute accuracy, and overall visual quality after RL training. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Seed rubric generation, stage 1: vision reasoner that produces a step-by-step preference rationale for each image pair. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Seed rubric generation, stages 2-3: rule extractor rule merger [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: VLM judge templates: Yes/No binary scoring. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Hard-pair refinement prompt. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Screenshot of the human evaluation survey interface. Annotators were asked to choose the [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Optimized rubric set for Qwen-3-VL-8B trained on HPSv3 preference pairs (round 3) [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Optimized rubric set for Qwen-3-VL-8B trained on PickScore preference pairs (round 6). 26 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: Optimized rubric set for Qwen-3-VL-32B trained on HPSv3 preference pairs (round 3) [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

**Figure 15.** Figure 15: Optimized rubric set for Qwen-3-VL-32B trained on PickScore preference pairs (round 6). 27 [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

read the original abstract

Aligning Text-to-Image (T2I) generation models with human preferences increasingly relies on image reward models that score or rank generated images according to prompt alignment and perceptual quality. Existing reward models are commonly trained as Bradley-Terry (BT) preference models on large-scale human preference corpora, making them costly to train, difficult to adapt, and opaque in their evaluation criteria. Meanwhile, Vision-Language Model (VLM) judges can provide more fine-grained assessments through textual rubrics, but their manually designed or heuristically generated scoring rules may fail to reliably reflect human preferences. In this paper, we propose AutoRubric-T2I, the first rubric learning framework in T2I that automatically synthesizes and selects explicit rubrics for guiding VLM judges. AutoRubric-T2I first synthesizes reasoning traces from preference pairs into candidate rubrics, then uses a VLM judge to score paired images under each rubric, producing pairwise rubric-score differences for preference learning. To remove noisy and redundant rules, we further employ a $\ell_1$-Regularized Logistic Regression Refiner, which selects the Top-$N$ most discriminative rubrics. Extensive evaluations show that AutoRubric-T2I produces high-quality, interpretable reward signals using less than 0.01% of the annotated preference data, substantially reducing the need for large-scale reward-model training. On image reward benchmarks such as MMRB2, AutoRubric-T2I outperforms strong reward model baselines. We further validate AutoRubric-T2I as an RL reward on downstream T2I tasks, including TIIF and UniGenBench++, where it improves generation quality over scalar reward models using the Flow-GRPO pipeline on diffusion models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoRubric-T2I auto-generates and prunes rubrics for VLM judges in T2I alignment, cutting data use to under 0.01% while claiming better results than standard reward models.

read the letter

The main point is a pipeline that pulls reasoning traces from a small set of preference pairs, turns them into candidate rubrics, scores image pairs with a VLM, and then uses L1-regularized logistic regression to keep only the top N most useful ones. This produces an explicit, interpretable reward signal instead of a black-box Bradley-Terry model. They report it matches or beats baselines on MMRB2 and improves downstream RL generation on tasks like TIIF with far less human data. That combination of automatic rubric creation and selection is the actual novelty here, and it directly addresses the opacity and data hunger of current T2I reward models. The interpretability angle is useful for anyone who needs to debug or adapt the scoring rules later. The data reduction claim, if it holds, would matter for practical work where large preference corpora are expensive. The soft spot is the circularity the stress-test flags. Rubric synthesis and the regression target both come from the same tiny preference slice, so the L1 step could latch onto VLM-specific artifacts or spurious patterns that do not generalize. The abstract does not mention held-out rubric validation or causal checks, so the full paper needs to show that the selected rubrics still work on fresh data and that the gains are not just in-sample. Without those controls the performance numbers are harder to trust. This is for people building or tuning reward models for text-to-image and similar generative tasks who want something more transparent than scalar BT models. Readers working on efficient alignment or interpretable evaluation will get the most out of it. The core idea is coherent and the claims are testable, so it deserves a serious referee even if the experiments need tightening.

Referee Report

2 major / 2 minor

Summary. The paper presents AutoRubric-T2I, a framework for automatically learning explicit rubrics to guide Vision-Language Model (VLM) judges in evaluating text-to-image (T2I) generations. It synthesizes reasoning traces from a small subset (<0.01%) of human preference pairs into candidate rubrics, scores paired images using a VLM under each rubric, and applies an ℓ1-regularized logistic regression to select the top-N most discriminative rubrics. The approach claims to produce high-quality, interpretable reward signals that outperform strong baselines on benchmarks like MMRB2 while improving downstream T2I generation quality in RL fine-tuning, all with drastically reduced data requirements.

Significance. If validated, this work offers a significant advancement in making reward models for T2I alignment more data-efficient, interpretable, and adaptable compared to traditional Bradley-Terry models trained on large corpora. By leveraging VLM judges with learned rubrics, it addresses opacity in existing reward models and reduces the cost of large-scale preference data collection. The potential for rule-based, human-aligned evaluation could influence future work in multimodal alignment and RL for generative models.

major comments (2)

[§3.2–3.3] §3.2–3.3 (rubric scoring and L1 refiner): The logistic regression is fit directly to pairwise score differences computed from the identical small preference subset used to synthesize the candidate rubrics in §3.1. This creates a dependence that risks selecting rubrics exploiting VLM-specific artifacts or in-sample correlations rather than generalizable rules; no held-out rubric validation or causal test is described to rule out overfitting.
[§4] §4 (experiments): Claims of outperforming baselines on MMRB2 and improving downstream RL tasks (TIIF, UniGenBench++) with <0.01% data are stated without reported numerical values, error bars, ablation on Top-N or regularization strength, or controls for systematic bias in the VLM judge itself. These omissions make it impossible to verify that the reported gains are robust rather than artifacts of the synthesis set.

minor comments (2)

[§3.3] The free parameters (Top-N, L1 strength) are mentioned but their selection procedure or sensitivity analysis is not detailed; add a short paragraph or table showing how they were chosen.
[§3.2] Notation for rubric-score differences and the logistic target should be formalized with an equation to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of our method and results.

read point-by-point responses

Referee: [§3.2–3.3] §3.2–3.3 (rubric scoring and L1 refiner): The logistic regression is fit directly to pairwise score differences computed from the identical small preference subset used to synthesize the candidate rubrics in §3.1. This creates a dependence that risks selecting rubrics exploiting VLM-specific artifacts or in-sample correlations rather than generalizable rules; no held-out rubric validation or causal test is described to rule out overfitting.

Authors: We agree that fitting the L1-regularized logistic regression on the same small subset used for rubric synthesis introduces a risk of selecting rules that capture in-sample correlations or VLM-specific patterns. The synthesis step generates candidate rubrics from reasoning traces, but the subsequent scoring and selection occur on the identical pairs. To address this, we have added a held-out validation procedure in the revised §3.3: after rubric selection on the synthesis subset, we evaluate the selected rubrics on a disjoint held-out portion of the preference data and report the resulting preference prediction accuracy. We also include a brief causal-style check by measuring rubric stability across different random splits of the synthesis set. These additions are now described in the revised manuscript. revision: yes
Referee: [§4] §4 (experiments): Claims of outperforming baselines on MMRB2 and improving downstream RL tasks (TIIF, UniGenBench++) with <0.01% data are stated without reported numerical values, error bars, ablation on Top-N or regularization strength, or controls for systematic bias in the VLM judge itself. These omissions make it impossible to verify that the reported gains are robust rather than artifacts of the synthesis set.

Authors: The original submission emphasized relative improvements and data efficiency but did not include the full set of quantitative results, error bars, or ablations requested. We have revised §4 to include: (i) exact numerical scores and standard deviations on MMRB2 for AutoRubric-T2I versus the strongest baselines, (ii) ablation tables varying Top-N and the L1 regularization coefficient, (iii) error bars across three random seeds for both reward-model and downstream RL experiments, and (iv) a control experiment that replaces the learned rubrics with a fixed generic VLM prompt to isolate the contribution of the selected rubrics from any systematic VLM bias. These results are now reported with the corresponding tables and figures in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline with external validation

full rationale

The paper presents an empirical framework that synthesizes candidate rubrics from a small subset of preference pairs, scores them via an external VLM, and applies L1-regularized logistic regression for selection. This process is a standard data-driven feature selection step within a proposed method, not a first-principles derivation or prediction that reduces to its inputs by construction. Performance is assessed on separate benchmarks (MMRB2) and downstream RL tasks, with no load-bearing self-citations, uniqueness theorems, or self-definitional equations identified in the described chain. The approach remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Information is limited to the abstract; the method relies on a small number of hyperparameters for the refiner and on the domain assumption that VLM rubric scoring tracks human preference.

free parameters (2)

Top-N
Number of rubrics retained after L1-regularized logistic regression; chosen to balance discriminativeness and noise.
L1 regularization strength
Controls sparsity in rubric selection; fitted or tuned on the preference-derived score differences.

axioms (1)

domain assumption Reasoning traces extracted from preference pairs can be converted into explicit rubrics that, when scored by a VLM, produce differences correlated with the original human judgments.
This premise is required for the synthesis step to produce useful candidate rubrics.

pith-pipeline@v0.9.0 · 5863 in / 1607 out tokens · 42905 ms · 2026-05-22T09:16:39.416748+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

min_w λ∥w∥1 + Σ logσ(z(i) Σ w_j Δs(i)_j) solved by block coordinate descent with hard-pair mining

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 9 internal anchors

[1]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Odin: Disentangled reward mitigates hacking in rlhf.arXiv preprint arXiv:2402.07319, 2024

Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mehrdad Farajtabar, and Hongyang Li. Odin: Disentangled reward mitigates hacking in rlhf.arXiv preprint arXiv:2402.07319, 2024

work page arXiv 2024
[3]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

work page 2024
[4]

Rubricrl: Simple generalizable rewards for text-to-image generation.arXiv preprint arXiv:2511.20651, 2025

Xuelu Feng, Yunsheng Li, Ziyu Wan, Zixuan Gao, Junsong Yuan, Dongdong Chen, and Chunming Qiao. Rubricrl: Simple generalizable rewards for text-to-image generation.arXiv preprint arXiv:2511.20651, 2025

work page arXiv 2025
[5]

Gemini 3 system card

Google DeepMind. Gemini 3 system card. https://deepmind.google/technologies/ gemini/, 2025. Accessed: 2026-04-23

work page 2025
[6]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Runxin Zhang, Runze Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Understanding reward hacking in text-to-image reinforcement learning.arXiv preprint arXiv:2601.03468, 2026

Yunqi Hong, Kuei-Chun Kao, Hengguang Zhou, and Cho-Jui Hsieh. Understanding reward hacking in text-to-image reinforcement learning.arXiv preprint arXiv:2601.03468, 2026

work page arXiv 2026
[9]

Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A. Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering.arXiv preprint arXiv:2303.11897, 2023

work page arXiv 2023
[10]

Multimodal rewardbench 2: Evaluating omni reward models for interleaved text and image.arXiv preprint arXiv:2512.16899, 2025

Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer, and Marjan Ghazvininejad. Multimodal rewardbench 2: Evaluating omni reward models for interleaved text and image.arXiv preprint arXiv:2512.16899, 2025

work page arXiv 2025
[11]

Reinforcement learning with rubric anchors

Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, et al. Reinforcement learning with rubric anchors. arXiv preprint arXiv:2508.12790, 2025

work page arXiv 2025
[12]

Orthogonal matching pursuit with replacement

Prateek Jain, Ambuj Tewari, and Inderjit Dhillon. Orthogonal matching pursuit with replacement. Advances in neural information processing systems, 24, 2011. 10

work page 2011
[13]

Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, and Wenhu Chen. Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

work page 2024
[14]

T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng- Ann Heng, and Hongsheng Li. T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

work page arXiv 2025
[15]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

work page 2023
[16]

Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, and Deva Ramanan. Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

work page arXiv 2024
[17]

Rubrichub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation.arXiv preprint arXiv:2601.08430, 2026

Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, and Wei Chen. Rubrichub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation.arXiv preprint arXiv:2601.08430, 2026

work page arXiv 2026
[18]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Openrubrics: Contrastive rubric generation for reward models.arXiv preprint arXiv:2505.14826, 2025

Zhen Liu, Yixin Wang, Jianfei Chen, and Jun Zhu. Openrubrics: Contrastive rubric generation for reward models.arXiv preprint arXiv:2505.14826, 2025

work page arXiv 2025
[20]

Hpsv3: Towards wide-spectrum hu- man preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum hu- man preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

work page 2025
[21]

Or- thogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition

Yagyensh Chandra Pati, Ramin Rezaiifar, and Perinkulam Sambamurthy Krishnaprasad. Or- thogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. InProceedings of 27th Asilomar conference on signals, systems and computers, pages 40–44. IEEE, 1993

work page 1993
[22]

Onlinerubrics: Dynamic rubric elicitation for online reinforcement learning.arXiv preprint arXiv:2507.09832, 2025

Keivan Rezaei, Xuechen He, and Percy Liang. Onlinerubrics: Dynamic rubric elicitation for online reinforcement learning.arXiv preprint arXiv:2507.09832, 2025

work page arXiv 2025
[23]

Rrd: Recursive rubric decomposition for scalable reward modeling.arXiv preprint arXiv:2601.05743, 2026

Yifan Shen, Xiang Li, Wei Zhang, and Yang Liu. Rrd: Recursive rubric decomposition for scalable reward modeling.arXiv preprint arXiv:2601.05743, 2026

work page arXiv 2026
[24]

Autorule: Reasoning chain-of-thought extracted rule-based rewards improve preference learning.arXiv preprint arXiv:2506.15651, 2025

Tevin Wang and Chenyan Xiong. Autorule: Reasoning chain-of-thought extracted rule-based rewards improve preference learning.arXiv preprint arXiv:2506.15651, 2025

work page arXiv 2025
[25]

Unigenbench++: A unified semantic evaluation benchmark for text-to-image generation.arXiv preprint arXiv:2510.18701, 2025

Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, et al. Unigenbench++: A unified semantic evaluation benchmark for text-to-image generation.arXiv preprint arXiv:2510.18701, 2025

work page arXiv 2025
[26]

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025

Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, and Lei Zhang. Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025

work page arXiv 2025
[28]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, et al. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Rewarddance: Scaling visual reward modeling via generative next-token prediction.arXiv preprint arXiv:2504.12345, 2025

Xiaoshi Wu, Yiming Li, Keqiang Zhang, and Hongsheng Li. Rewarddance: Scaling visual reward modeling via generative next-token prediction.arXiv preprint arXiv:2504.12345, 2025. 11

work page arXiv 2025
[31]

Auto-rubric: Learning from implicit weights to explicit rubrics for reward modeling.arXiv preprint arXiv:2510.17314, 2025

Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, et al. Auto-rubric: Learning from implicit weights to explicit rubrics for reward modeling.arXiv preprint arXiv:2510.17314, 2025

work page arXiv 2025
[32]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

work page 2023
[33]

Alternating reinforcement learning for rubric-based reward modeling in non-verifiable llm post-training.arXiv preprint arXiv:2602.01511, 2026

Ran Xu, Tianci Liu, Zihan Dong, Tony Yu, Ilgee Hong, Carl Yang, Linjun Zhang, Tao Zhao, and Haoyu Wang. Alternating reinforcement learning for rubric-based reward modeling in non-verifiable llm post-training.arXiv preprint arXiv:2602.01511, 2026

work page arXiv 2026
[34]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Automated filtering of human feedback data for aligning text-to-image diffusion models.arXiv preprint arXiv:2410.10166, 2024

Yongjin Yang, Sihyeon Kim, Hojung Jung, Sangmin Bae, SangMook Kim, Se-Young Yun, and Kimin Lee. Automated filtering of human feedback data for aligning text-to-image diffusion models.arXiv preprint arXiv:2410.10166, 2024

work page arXiv 2024
[37]

Sparse random feature algorithm as coordinate descent in hilbert space.Advances in Neural Information Processing Systems, 27, 2014

Ian E Yen, Ting-Wei Lin, Shou-De Lin, Pradeep Ravikumar, and Inderjit S Dhillon. Sparse random feature algorithm as coordinate descent in hilbert space.Advances in Neural Information Processing Systems, 27, 2014

work page 2014
[38]

Does this image satisfy this rule?

Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, and Lifeng Jin. Chasing the tail: Effective rubric- based reward modeling for large language model post-training.arXiv preprint arXiv:2509.21500, 2026. 12 Table of Contents of Appendix A AutoRubric-T2I Pipeline Algorithm 14 B RL...

work page arXiv 2026
[39]

thinking-with-images

[10] is a recent omni reward-model benchmark covering four subtasks—text-to-image,image editing,interleaved generation, andmultimodal reasoning(“thinking-with-images”)—with 1,000 expert-annotated preference pairs per subtask drawn from 23 frontier models across 21 source tasks. Generative T2I Benchmarks.For T2I generative quality assessment on RL post-tra...

work page

[1] [1]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Odin: Disentangled reward mitigates hacking in rlhf.arXiv preprint arXiv:2402.07319, 2024

Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mehrdad Farajtabar, and Hongyang Li. Odin: Disentangled reward mitigates hacking in rlhf.arXiv preprint arXiv:2402.07319, 2024

work page arXiv 2024

[3] [3]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

work page 2024

[4] [4]

Rubricrl: Simple generalizable rewards for text-to-image generation.arXiv preprint arXiv:2511.20651, 2025

Xuelu Feng, Yunsheng Li, Ziyu Wan, Zixuan Gao, Junsong Yuan, Dongdong Chen, and Chunming Qiao. Rubricrl: Simple generalizable rewards for text-to-image generation.arXiv preprint arXiv:2511.20651, 2025

work page arXiv 2025

[5] [5]

Gemini 3 system card

Google DeepMind. Gemini 3 system card. https://deepmind.google/technologies/ gemini/, 2025. Accessed: 2026-04-23

work page 2025

[6] [6]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Runxin Zhang, Runze Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Understanding reward hacking in text-to-image reinforcement learning.arXiv preprint arXiv:2601.03468, 2026

Yunqi Hong, Kuei-Chun Kao, Hengguang Zhou, and Cho-Jui Hsieh. Understanding reward hacking in text-to-image reinforcement learning.arXiv preprint arXiv:2601.03468, 2026

work page arXiv 2026

[9] [9]

Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A. Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering.arXiv preprint arXiv:2303.11897, 2023

work page arXiv 2023

[10] [10]

Multimodal rewardbench 2: Evaluating omni reward models for interleaved text and image.arXiv preprint arXiv:2512.16899, 2025

Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer, and Marjan Ghazvininejad. Multimodal rewardbench 2: Evaluating omni reward models for interleaved text and image.arXiv preprint arXiv:2512.16899, 2025

work page arXiv 2025

[11] [11]

Reinforcement learning with rubric anchors

Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, et al. Reinforcement learning with rubric anchors. arXiv preprint arXiv:2508.12790, 2025

work page arXiv 2025

[12] [12]

Orthogonal matching pursuit with replacement

Prateek Jain, Ambuj Tewari, and Inderjit Dhillon. Orthogonal matching pursuit with replacement. Advances in neural information processing systems, 24, 2011. 10

work page 2011

[13] [13]

Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, and Wenhu Chen. Genai arena: An open evaluation platform for generative models.Advances in Neural Information Processing Systems, 37:79889–79908, 2024

work page 2024

[14] [14]

T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng- Ann Heng, and Hongsheng Li. T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

work page arXiv 2025

[15] [15]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

work page 2023

[16] [16]

Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, and Deva Ramanan. Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

work page arXiv 2024

[17] [17]

Rubrichub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation.arXiv preprint arXiv:2601.08430, 2026

Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, and Wei Chen. Rubrichub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation.arXiv preprint arXiv:2601.08430, 2026

work page arXiv 2026

[18] [18]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Openrubrics: Contrastive rubric generation for reward models.arXiv preprint arXiv:2505.14826, 2025

Zhen Liu, Yixin Wang, Jianfei Chen, and Jun Zhu. Openrubrics: Contrastive rubric generation for reward models.arXiv preprint arXiv:2505.14826, 2025

work page arXiv 2025

[20] [20]

Hpsv3: Towards wide-spectrum hu- man preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum hu- man preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

work page 2025

[21] [21]

Or- thogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition

Yagyensh Chandra Pati, Ramin Rezaiifar, and Perinkulam Sambamurthy Krishnaprasad. Or- thogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. InProceedings of 27th Asilomar conference on signals, systems and computers, pages 40–44. IEEE, 1993

work page 1993

[22] [22]

Onlinerubrics: Dynamic rubric elicitation for online reinforcement learning.arXiv preprint arXiv:2507.09832, 2025

Keivan Rezaei, Xuechen He, and Percy Liang. Onlinerubrics: Dynamic rubric elicitation for online reinforcement learning.arXiv preprint arXiv:2507.09832, 2025

work page arXiv 2025

[23] [23]

Rrd: Recursive rubric decomposition for scalable reward modeling.arXiv preprint arXiv:2601.05743, 2026

Yifan Shen, Xiang Li, Wei Zhang, and Yang Liu. Rrd: Recursive rubric decomposition for scalable reward modeling.arXiv preprint arXiv:2601.05743, 2026

work page arXiv 2026

[24] [24]

Autorule: Reasoning chain-of-thought extracted rule-based rewards improve preference learning.arXiv preprint arXiv:2506.15651, 2025

Tevin Wang and Chenyan Xiong. Autorule: Reasoning chain-of-thought extracted rule-based rewards improve preference learning.arXiv preprint arXiv:2506.15651, 2025

work page arXiv 2025

[25] [25]

Unigenbench++: A unified semantic evaluation benchmark for text-to-image generation.arXiv preprint arXiv:2510.18701, 2025

Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, et al. Unigenbench++: A unified semantic evaluation benchmark for text-to-image generation.arXiv preprint arXiv:2510.18701, 2025

work page arXiv 2025

[26] [26]

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025

Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, and Lei Zhang. Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025

work page arXiv 2025

[28] [28]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, et al. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Rewarddance: Scaling visual reward modeling via generative next-token prediction.arXiv preprint arXiv:2504.12345, 2025

Xiaoshi Wu, Yiming Li, Keqiang Zhang, and Hongsheng Li. Rewarddance: Scaling visual reward modeling via generative next-token prediction.arXiv preprint arXiv:2504.12345, 2025. 11

work page arXiv 2025

[31] [31]

Auto-rubric: Learning from implicit weights to explicit rubrics for reward modeling.arXiv preprint arXiv:2510.17314, 2025

Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, et al. Auto-rubric: Learning from implicit weights to explicit rubrics for reward modeling.arXiv preprint arXiv:2510.17314, 2025

work page arXiv 2025

[32] [32]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

work page 2023

[33] [33]

Alternating reinforcement learning for rubric-based reward modeling in non-verifiable llm post-training.arXiv preprint arXiv:2602.01511, 2026

Ran Xu, Tianci Liu, Zihan Dong, Tony Yu, Ilgee Hong, Carl Yang, Linjun Zhang, Tao Zhao, and Haoyu Wang. Alternating reinforcement learning for rubric-based reward modeling in non-verifiable llm post-training.arXiv preprint arXiv:2602.01511, 2026

work page arXiv 2026

[34] [34]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Automated filtering of human feedback data for aligning text-to-image diffusion models.arXiv preprint arXiv:2410.10166, 2024

Yongjin Yang, Sihyeon Kim, Hojung Jung, Sangmin Bae, SangMook Kim, Se-Young Yun, and Kimin Lee. Automated filtering of human feedback data for aligning text-to-image diffusion models.arXiv preprint arXiv:2410.10166, 2024

work page arXiv 2024

[37] [37]

Sparse random feature algorithm as coordinate descent in hilbert space.Advances in Neural Information Processing Systems, 27, 2014

Ian E Yen, Ting-Wei Lin, Shou-De Lin, Pradeep Ravikumar, and Inderjit S Dhillon. Sparse random feature algorithm as coordinate descent in hilbert space.Advances in Neural Information Processing Systems, 27, 2014

work page 2014

[38] [38]

Does this image satisfy this rule?

Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, and Lifeng Jin. Chasing the tail: Effective rubric- based reward modeling for large language model post-training.arXiv preprint arXiv:2509.21500, 2026. 12 Table of Contents of Appendix A AutoRubric-T2I Pipeline Algorithm 14 B RL...

work page arXiv 2026

[39] [39]

thinking-with-images

[10] is a recent omni reward-model benchmark covering four subtasks—text-to-image,image editing,interleaved generation, andmultimodal reasoning(“thinking-with-images”)—with 1,000 expert-annotated preference pairs per subtask drawn from 23 frontier models across 21 source tasks. Generative T2I Benchmarks.For T2I generative quality assessment on RL post-tra...

work page