pith. machine review for the scientific record. sign in

arxiv: 2605.08354 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:55 UTC · model grok-4.3

classification 💻 cs.AI
keywords multimodal alignmentrubrics as rewardreward modelingRLHFtext-to-image generationpreference learningvision-language models
0
0 comments X

The pith

Turning a vision-language model's hidden preferences into explicit prompt-specific rubrics before any comparison produces more reliable and data-efficient rewards for multimodal generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard reward approaches for aligning image generators collapse human judgment into single numbers or direct comparisons, losing structure and inviting bias. It proposes first extracting structured rubrics directly from the model itself for each prompt, turning overall intent into separate, checkable quality dimensions. This externalization step is presented as the key move that cuts evaluation biases such as positional favoritism and supports both immediate use and quick adaptation with little data. If the claim holds, alignment improves because the interface between knowledge and reward becomes explicit and factorized rather than because more preference data is collected.

Core claim

ARR externalizes a VLM's internalized preference knowledge as prompt-specific rubrics, translating holistic intent into independently verifiable quality dimensions before any pairwise comparison. This conversion of implicit preference structure into inspectable, interpretable constraints substantially suppresses evaluation biases including positional bias, enabling both zero-shot deployment and few-shot conditioning on minimal supervision. Rubric Policy Optimization then distills the structured evaluation into a robust binary reward that replaces opaque scalar regression with rubric-conditioned preference decisions for stable policy gradients.

What carries the argument

The Auto-Rubric as Reward process that generates explicit, prompt-specific rubrics from a VLM prior to pairwise comparison, acting as a factorized interface that converts implicit preferences into verifiable quality dimensions.

If this is right

  • ARR-RPO outperforms pairwise reward models and direct VLM judges on text-to-image generation and image editing benchmarks.
  • The method enables zero-shot deployment without additional training and few-shot adaptation using minimal supervision.
  • Explicit rubrics suppress positional bias and other evaluation artifacts that affect direct comparison methods.
  • Replacing scalar regression with rubric-conditioned binary decisions stabilizes policy gradients during generative training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same externalization step could be applied to other generative tasks such as video or 3D synthesis if the underlying VLM supports those domains.
  • If rubrics generated by different VLMs converge on similar criteria for the same prompts, they could serve as a shared, inspectable standard for evaluation.
  • Comparing ARR rubrics against human-written rubrics on the same prompts would test whether the externalization step preserves or alters preference structure.

Load-bearing premise

A vision-language model can reliably turn its own internalized preferences into prompt-specific rubrics that remain independently verifiable and free of the evaluation biases the approach aims to remove.

What would settle it

Human raters judge that image generations selected using ARR rubrics are no better aligned with preferences than generations selected by standard pairwise reward models on the same text-to-image and image-editing benchmarks.

Figures

Figures reproduced from arXiv: 2605.08354 by Fengyuan Liu, Furong Xu, Haodong Li, Jiaming Han, Juanxi Tian, Wanhua Li, Yesheng Liu, Yilei Jiang, Yongliang Wu.

Figure 1
Figure 1. Figure 1: Overview of the ARR-RPO framework. 3.1 Problem Formulation We formulate preference learning as estimating the optimal parameters of a probabilistic model Pθ that, given a prompt x and candidate outputs y +, y−, assigns higher likelihood to the response better satisfying human intent. Preference alignment thus optimizes Pθ to capture and generalize human preferences, raising the central design question: how… view at source ↗
Figure 2
Figure 2. Figure 2: Performance comparison of ARR-RPO variants against specialist models across text-to￾image generation (top) and image editing (bottom) benchmarks. Source Image Input Prompt: w/ FLUX.1.dev A squirrel lowering its head while eating a banana, highly realistic style, detailed fur texture, natural lighting, sharp focus, lifelike scene. w/ Gemini 3.1 Pro w/ Qwen-Image-Edit w/ Gemini 3.1 Pro Put a pond next to the… view at source ↗
Figure 3
Figure 3. Figure 3: Text-to-Image and Image Editing Examples (ARR-RPO Gemini 3.1 Pro). generated rubrics as binary reward signals. We instantiate ARR with three VLMs, Qwen3-VL-8B [2], GPT-5 [33], and Gemini 3.1 Pro [12], to examine how rubric quality scales with judge capability. Results [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation studies on ARR. (a) Forward–Reverse preference gaps across evaluators. (b) Cross-model rubric transfer with a fixed judge. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of text-to-image generation. Origin Qwen-Image-Edit -2509 Qwen- Image- Edit - 2508 w/ ARR-RPO -Gemini3.1 pro Make the doll wear a hat. Give her a baseball cap and made them colorful Give her a skirt to wear. Let's add a dog next to the cows. Make the stop sing an animal warning sign [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of image editing. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Auto-generated T2I rubrics (Gemini 3.1 Pro). Example prompt-conditioned rubrics automatically synthesized by ARR for text-to-image evaluation, spanning dimensions such as architectural fidelity, lighting consistency, texture realism, and AI artifact detection. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: T2I evaluation system prompt. The prompt template used to instruct the VLM judge to perform pairwise comparison for text-to-image generation, including task description, output format requirements, and anti-position-bias reminders. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Auto-generated image editing rubrics (Gemini 3.1 Pro). Example prompt-conditioned rubrics automatically synthesized by ARR for image editing evaluation, covering fidelity preservation, material integrity, lighting consistency, and artifact elimination. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Image editing evaluation system prompt. The prompt template used to instruct the VLM judge to perform pairwise comparison for image editing, where Image BASE serves as the ground-truth reference for fidelity assessment. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
read the original abstract

Aligning multimodal generative models with human preferences demands reward signals that respect the compositional, multi-dimensional structure of human judgment. Prevailing RLHF approaches reduce this structure to scalar or pairwise labels, collapsing nuanced preferences into opaque parametric proxies and exposing vulnerabilities to reward hacking. While recent Rubrics-as-Reward (RaR) methods attempt to recover this structure through explicit criteria, generating rubrics that are simultaneously reliable, scalable, and data-efficient remains an open problem. We introduce Auto-Rubric as Reward (ARR), a framework that reframes reward modeling from implicit weight optimization to explicit, criteria-based decomposition. Before any pairwise comparison, ARR externalizes a VLM's internalized preference knowledge as prompt-specific rubrics, translating holistic intent into independently verifiable quality dimensions. This conversion of implicit preference structure into inspectable, interpretable constraints substantially suppresses evaluation biases including positional bias, enabling both zero-shot deployment and few-shot conditioning on minimal supervision. To extend these gains into generative training, we propose Rubric Policy Optimization (RPO), which distills ARR's structured multi-dimensional evaluation into a robust binary reward, replacing opaque scalar regression with rubric-conditioned preference decisions that stabilize policy gradients. On text-to-image generation and image editing benchmarks, ARR-RPO outperforms pairwise reward models and VLM judges, demonstrating that explicitly externalizing implicit preference knowledge into structured rubrics achieves more reliable, data-efficient multimodal alignment, revealing that the bottleneck is the absence of a factorized interface, not a deficit of knowledge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Auto-Rubric as Reward (ARR), a framework that externalizes a VLM's implicit preferences into prompt-specific, explicit rubrics before any pairwise comparison, converting holistic judgments into independently verifiable quality dimensions. It further proposes Rubric Policy Optimization (RPO) to distill these structured evaluations into a binary reward signal for generative policy training. The authors claim this approach substantially suppresses biases (e.g., positional bias), enables zero-shot and few-shot alignment, and outperforms pairwise reward models and direct VLM judges on text-to-image generation and image editing benchmarks, arguing that the alignment bottleneck is the lack of a factorized interface rather than insufficient knowledge.

Significance. If the empirical claims hold with proper validation, the work would be significant for multimodal RLHF by shifting from opaque scalar rewards to explicit, inspectable criteria, potentially mitigating reward hacking and improving data efficiency. The emphasis on pre-comparison rubric externalization offers a concrete mechanism for bias reduction that could generalize beyond current VLM judges.

major comments (2)
  1. [Abstract] Abstract: The central claim that ARR 'substantially suppresses evaluation biases including positional bias' and achieves benchmark outperformance is stated without any quantitative results, ablation studies, error analysis, or verification mechanism (e.g., human validation of rubric independence or inter-rubric consistency metrics). This evidence gap is load-bearing because the diagnosis that 'the bottleneck is the absence of a factorized interface' rests on the rubrics being independently verifiable and freer of the VLM's original biases.
  2. [Method] Method description (inferred from abstract and framework): Rubric generation is performed by the identical VLM whose preferences are being aligned, with no described external anchoring, human oversight, or post-generation consistency checks. If rubric creation inherits the same implicit weightings that produce positional or holistic biases in direct judgment, the subsequent RPO distillation cannot isolate or remove them; a concrete test (e.g., ablation comparing ARR rubrics against direct VLM scoring on bias-sensitive prompts) is required to substantiate independence.
minor comments (1)
  1. [Abstract] The abstract introduces several new terms (ARR, RPO) without immediate forward references to their formal definitions or pseudocode; adding a brief notation table or early equation block would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below with references to the manuscript content and indicate planned revisions where they strengthen clarity without altering the core claims or experiments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that ARR 'substantially suppresses evaluation biases including positional bias' and achieves benchmark outperformance is stated without any quantitative results, ablation studies, error analysis, or verification mechanism (e.g., human validation of rubric independence or inter-rubric consistency metrics). This evidence gap is load-bearing because the diagnosis that 'the bottleneck is the absence of a factorized interface' rests on the rubrics being independently verifiable and freer of the VLM's original biases.

    Authors: The abstract serves as a concise summary of contributions and high-level claims. Quantitative benchmark results demonstrating outperformance over pairwise reward models and direct VLM judges on text-to-image generation and image editing tasks are presented in Section 4, with specific metrics and comparisons. Ablation studies on bias suppression (including positional bias) and human validation of rubric quality, independence, and inter-rubric consistency appear in Section 5.2 and 5.3. These sections directly support the factorized-interface diagnosis. We will revise the abstract to include brief quantitative highlights and explicit cross-references to these sections. revision: partial

  2. Referee: [Method] Method description (inferred from abstract and framework): Rubric generation is performed by the identical VLM whose preferences are being aligned, with no described external anchoring, human oversight, or post-generation consistency checks. If rubric creation inherits the same implicit weightings that produce positional or holistic biases in direct judgment, the subsequent RPO distillation cannot isolate or remove them; a concrete test (e.g., ablation comparing ARR rubrics against direct VLM scoring on bias-sensitive prompts) is required to substantiate independence.

    Authors: The ARR design intentionally uses the same VLM to externalize its internalized preferences into explicit, prompt-specific rubrics prior to any comparison or distillation. This externalization step converts holistic judgments into independently inspectable dimensions, which the manuscript shows reduces biases (as evidenced by superior performance versus direct VLM judges and pairwise models on bias-sensitive benchmarks). The subsequent RPO step further distills these into binary rewards. While the current manuscript does not include a dedicated ablation isolating rubric generation from direct VLM scoring, the overall empirical gains on zero-shot/few-shot alignment and bias metrics substantiate the approach. We will add a clarifying paragraph in the method section on this design rationale and include the requested ablation comparing ARR rubrics to direct VLM scoring on positional-bias prompts in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: new framework with independent empirical claims

full rationale

The paper introduces ARR (externalizing VLM preferences into prompt-specific rubrics before comparison) and RPO (distilling those into binary rewards) as methodological contributions. The abstract and description frame bias suppression and improved alignment as outcomes of this explicit decomposition, validated on text-to-image and editing benchmarks. No equations, fitted parameters, or derivations are shown that reduce a claimed result to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked to justify core steps. The central diagnosis (bottleneck is missing factorized interface) is presented as interpretive rather than proven by self-reference. This is a standard proposal of a new interface with external evaluation, scoring 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that VLMs possess extractable, factorized preference knowledge that can be turned into verifiable rubrics without introducing new biases. No free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)
  • domain assumption VLMs contain internalized preference knowledge that can be externalized into prompt-specific, independently verifiable quality dimensions before pairwise comparison.
    Invoked in the description of ARR converting implicit structure into explicit constraints.
invented entities (2)
  • Auto-Rubric as Reward (ARR) framework no independent evidence
    purpose: Externalize VLM preferences into rubrics for reward modeling
    New method introduced to reframe reward from implicit optimization to explicit criteria.
  • Rubric Policy Optimization (RPO) no independent evidence
    purpose: Distill structured rubric evaluation into binary rewards for policy training
    Proposed to stabilize gradients by replacing scalar regression with rubric-conditioned decisions.

pith-pipeline@v0.9.0 · 5594 in / 1452 out tokens · 48051 ms · 2026-05-12T00:55:37.046553+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/LogicAsFunctionalEquation.lean SatisfiesLawsOfLogic echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    ARR externalizes a VLM’s internalized preference knowledge as prompt-specific rubrics, translating holistic intent into independently verifiable quality dimensions. This conversion of implicit preference structure into inspectable, interpretable constraints substantially suppresses evaluation biases

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    reframes reward modeling from implicit weight optimization to explicit, criteria-based decomposition

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 15 internal anchors

  1. [1]

    Critique-out-loud re- ward models.arXiv preprint arXiv:2408.11791, 2024

    Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D Chang, and Prithviraj Ammanabrolu. Critique-out-loud reward models.arXiv preprint arXiv:2408.11791, 2024

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  3. [3]

    Improving image generation with better captions.arXiv preprint arXiv:2310.07685, 2023

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions.arXiv preprint arXiv:2310.07685, 2023

  4. [4]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

  5. [5]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

  6. [6]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

  7. [7]

    ShareGPT-4o-Image: Aligning multimodal models with GPT-4o-level image generation

    Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095, 2025

  8. [8]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  9. [9]

    Emerging properties in unified multimodal pretraining, 2025

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025

  10. [10]

    Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

  11. [11]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  12. [12]

    Gemini 3.1 Pro - Model Card

    Google DeepMind. Gemini 3.1 Pro - Model Card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, feb 2026

  13. [13]

    Llm- rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts

    Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. Llm- rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13806–13834, 2024

  14. [14]

    Clipscore: A reference-free evaluation metric for image captioning, 2022

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning, 2022

  15. [15]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

  16. [16]

    Multimodal rewardbench 2: Evaluating omni reward models for interleaved text and image.arXiv preprint arXiv:2512.16899, 2025

    Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer, and Marjan Ghazvininejad. Multimodal rewardbench 2: Evaluating omni reward models for interleaved text and image.arXiv preprint arXiv:2512.16899, 2025. 10

  17. [17]

    Au- torubric: Rubric-based generative rewards for faithful multimodal reasoning, 2026

    Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang, and Peng Qi. Au- torubric: Rubric-based generative rewards for faithful multimodal reasoning, 2026

  18. [18]

    Prometheus: Inducing fine-grained evaluation capability in language models

    Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. Prometheus: Inducing fine-grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representations, 2023

  19. [19]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

  20. [20]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

  21. [21]

    Holistic evaluation of text-to-image models.Advances in Neural Information Processing Systems, 36:69981–70011, 2023

    Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Teufel, Marco Bellagente, et al. Holistic evaluation of text-to-image models.Advances in Neural Information Processing Systems, 36:69981–70011, 2023

  22. [22]

    Hp-edit: A human- preference post-training framework for image editing, 2026

    Fan Li, Chonghuinan Wang, Lina Lei, Yuping Qiu, Jiaqi Xu, Jiaxiu Jiang, Xinran Qin, Zhikai Chen, Fenglong Song, Zhixin Wang, Renjing Pei, and Wangmeng Zuo. Hp-edit: A human- preference post-training framework for image editing, 2026

  23. [23]

    Uniworld-V2: Reinforce im- age editing with diffusion negative-aware finetuning and MLLM implicit feedback.arXiv preprint arXiv:2510.16888, 2025

    Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, et al. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888, 2025

  24. [24]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

  25. [25]

    Examining reasoning llms-as-judges in non-verifiable llm post-training.arXiv preprint arXiv:2603.12246, 2026

    Yixin Liu, Yue Yu, DiJia Su, Sid Wang, Xuewei Wang, Song Jiang, Bo Liu, Arman Cohan, Yuandong Tian, and Zhengxing Chen. Examining reasoning llms-as-judges in non-verifiable llm post-training.arXiv preprint arXiv:2603.12246, 2026

  26. [26]

    arXiv preprint arXiv:2509.23909 (2025)

    Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, et al. Editscore: Unlocking online rl for image editing via high-fidelity reward modeling.arXiv preprint arXiv:2509.23909, 2025

  27. [27]

    Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7739–7751, 2025

  28. [28]

    Hpsv3: Towards wide-spectrum hu- man preference score

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum hu- man preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

  29. [29]

    Rubriceval: A rubric-level meta-evaluation benchmark for llm judges in instruction following.arXiv preprint arXiv:2603.25133, 2026

    Tianjun Pan, Xuan Lin, Wenyan Yang, Qianyu He, Shisong Chen, Licai Qi, Wanqing Xu, Hong- wei Feng, Bo Xu, and Yanghua Xiao. Rubriceval: A rubric-level meta-evaluation benchmark for llm judges in instruction following.arXiv preprint arXiv:2603.25133, 2026

  30. [30]

    Rubric is all you need: Improving llm-based code evaluation with question-specific rubrics

    Aditya Pathak, Rachit Gandhi, Vaibhav Uttam, Arnav Ramamoorthy, Pratyush Ghosh, Aaryan Raj Jindal, Shreyash Verma, Aditya Mittal, Aashna Ased, Chirag Khatri, et al. Rubric is all you need: Improving llm-based code evaluation with question-specific rubrics. InProceed- ings of the 2025 ACM Conference on International Computing Education Research V . 1, page...

  31. [31]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 11

  32. [32]

    Emu edit: Precise image editing via recognition and generation tasks

    Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024

  33. [33]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  34. [34]

    Diffusion model alignment using direct preference optimization, 2023

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization, 2023

  35. [35]

    Large language models are not fair evaluators

    Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, et al. Large language models are not fair evaluators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9440–9450, 2024

  36. [36]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

  37. [37]

    Unigenbench++: A unified semantic evaluation benchmark for text-to-image generation, 2026

    Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unigenbench++: A unified semantic evaluation benchmark for text-to-image generation, 2026

  38. [38]

    Unified multimodal chain-of-thought reward model through reinforcement fine-tuning, 2025

    Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning, 2025

  39. [39]

    Unified Reward Model for Multimodal Understanding and Generation

    Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025

  40. [40]

    Tiif-bench: How does your t2i model follow your instructions?, 2025

    Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, and Lei Zhang. Tiif-bench: How does your t2i model follow your instructions?, 2025

  41. [41]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  42. [42]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

  43. [43]

    arXiv preprint arXiv:2509.26346 (2025)

    Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu, and Wenhu Chen. Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

  44. [44]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  45. [45]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

  46. [46]

    Auto-rubric: Learning from implicit weights to explicit rubrics for reward modeling.arXiv preprint arXiv:2510.17314, 2025

    Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, et al. Auto-rubric: Learning from implicit weights to explicit rubrics for reward modeling.arXiv preprint arXiv:2510.17314, 2025

  47. [47]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

  48. [48]

    Flask: Fine-grained language model evaluation based on alignment skill sets.arXiv preprint arXiv:2307.10928, 2023

    Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, and Minjoon Seo. Flask: Fine-grained language model evaluation based on alignment skill sets.arXiv preprint arXiv:2307.10928, 2023. 12

  49. [49]

    Imgedit: A unified image editing dataset and benchmark, 2025

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark, 2025

  50. [50]

    Anyedit: Mastering unified high-quality image editing for any idea, 2025

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea, 2025

  51. [51]

    Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

  52. [52]

    Trust your critic: Robust reward modeling and reinforcement learning for faithful image editing and generation.arXiv preprint arXiv:2603.12247, 2026

    Xiangyu Zhao, Peiyuan Zhang, Junming Lin, Tianhao Liang, Yuchen Duan, Shengyuan Ding, Changyao Tian, Yuhang Zang, Junchi Yan, and Xue Yang. Trust your critic: Robust reward modeling and reinforcement learning for faithful image editing and generation.arXiv preprint arXiv:2603.12247, 2026

  53. [53]

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025. 13 Appendix A Experimental Setup Details This section provides a comprehensive account of the datasets, evaluation protoco...

  54. [54]

    Zero-shot rubric generation: ARR synthesizes rubrics on-the-fly from frozen VLMs, enabling immediate deployment in new domains without additional data collection or task-specific supervision

  55. [55]

    This preserves inter- criterion dependencies and avoids inconsistencies introduced by independent scoring and aggregation

    Holistic, rubric-conditioned decision interface: Rather than aggregating independently scored criteria post hoc, ARR formulates evaluation as a single rubric-conditioned judgment, where all dimensions are jointly considered in a pairwise comparison. This preserves inter- criterion dependencies and avoids inconsistencies introduced by independent scoring a...

  56. [56]

    Training-free reward interface: ARR operates without any parameter updates to the judge model, eliminating the computational and data overhead associated with training pointwise or pairwise reward models, while retaining strong generalization through the underlying VLM

  57. [57]

    rank": [rank_of_Image1, rank_of_Image2],

    Data-efficient rubric induction: Across all experiments, high-quality rubrics are con- structed from as few as 100 preference pairs drawn from ShareGPT-4o-Image. This demon- strates that ARR can recover structured, task-relevant evaluation criteria with minimal supervision, achieving competitive performance with substantially lower data requirements than ...