pith. machine review for the scientific record. sign in

arxiv: 2605.07477 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords image editing evaluationinterpretable reasoningreinforcement learningchain-of-thoughtmultimodal large language modelshuman preference alignmenttext-guided image editing
0
0 comments X

The pith

ReasonEdit trains an interpretable evaluator for text-guided image editing using reinforcement learning on human judgments of reasoning quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing evaluation of text-guided image edits relies on single numeric scores that reveal little about specific failures or successes. This paper creates ReasonEdit-22K, a dataset of 22K edited images paired with 113K chain-of-thought explanations and 1.3 million human ratings on the logicality, accuracy, and usefulness of those explanations. It then builds RE-Reward, a multimodal model that scores the quality of such reasoning, and uses the Group Relative Policy Optimization algorithm to train ReasonEdit as an evaluator that outputs both a judgment and readable supporting text. If the approach holds, evaluation shifts from opaque numbers to transparent explanations that developers and users can inspect and trust. This matters because clearer feedback can accelerate improvement of image editing systems by highlighting exactly where edits go wrong.

Core claim

ReasonEdit is an evaluation model for text-guided image editing trained with the Group Relative Policy Optimization algorithm on reward signals from RE-Reward. RE-Reward is a multimodal large language model that scores chain-of-thought interpretations according to human ratings of logicality, accuracy, and usefulness. The training data comes from the ReasonEdit-22K dataset of 22K edited images, 113K chain-of-thought samples, and 1.3M human judgments. On public benchmarks the resulting model aligns more closely with human preferences than prior scalar methods while also generating high-quality interpretable evaluation text.

What carries the argument

RE-Reward, an MLLM-based model that scores chain-of-thought reasoning chains on logicality, accuracy, and usefulness, whose signals then train ReasonEdit via the Group Relative Policy Optimization algorithm.

If this is right

  • Evaluation of image edits gains readable supporting text instead of single scores.
  • The trained model generalizes to multiple public benchmarks while preserving human alignment.
  • Developers obtain specific, inspectable feedback on artifacts and unintended changes.
  • Assessment of text-guided editing becomes more transparent and therefore more actionable.
  • Future editing models can be iterated using the same interpretable signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dataset-plus-reward-plus-GRPO pipeline could be applied to evaluation of other generative tasks such as video or 3D editing.
  • The generated reasoning text might be fed back into editing models to guide iterative refinement without additional human labels.
  • Systematic patterns in the model's explanations could expose recurring weaknesses in current text-guided editing techniques.

Load-bearing premise

The 1.3 million human judgments on logicality, accuracy, and usefulness collected for the ReasonEdit-22K dataset form a reliable and unbiased training signal that produces generalizable evaluators without circular dependence on the same ratings.

What would settle it

Collect fresh human ratings on the logicality, accuracy, and usefulness of evaluation text produced by ReasonEdit versus baseline methods on a large set of edited images never seen during training; if ReasonEdit text receives lower average ratings, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.07477 by Guangtao Zhai, Honghua Chen, Huiyu Duan, Xinyun Zhang, Xiongkuo Min, Zitong Xu.

Figure 1
Figure 1. Figure 1: Dataflow during the construction of ResonEdit-22K and its two subsets [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Annotation workflow strategies to guide the MLLMs to generate structured CoT reasoning alongside three-dimensional sub-scores regarding the visual quality, instruction alignment and content preservation, as well as an overall quality score. Finally, 113,898 CoT texts were successfully obtained. 3.2.2 CoT scoring and evaluation We then conducted the annotation for the CoT texts above via a custom-built web … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the RE-Reward architecture and SFT training [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the ReasonEdit architecture and SFT training [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Task-type distribution in ReasonEdit-22K [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Annotation interface for scoring candidate critiques [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The raw 1–5 ordinal scores after the annotation of ReasonEdit-Reward-113K [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example where interpretable reasoning reveals a failure not captured by a single scalar [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
read the original abstract

Recent text-guided image editing (TIE) models have achieved remarkable progress, however, many edited results still suffer from artifacts, unintended modifications, and suboptimal aesthetics. Although several benchmarks and evaluation methods have been proposed, most existing approaches rely on scalar scores and lack interpretability. This limitation largely stems from the absence of high-quality interpretation datasets for TIE and effective reward models to train interpretable evaluators. To address these challenges, we introduce ReasonEdit-22K, the first dataset that combines 22K edited images with 113K Chain-of-Thought (CoT) samples, along with 1.3M human judgments assessing these interpretations in terms of logicality, accuracy, and usefulness. Building upon this dataset, we propose RE-Reward, a multimodal large language model (MLLM)-based reward model designed to provide human-aligned feedback for evaluating interpretable reasoning in image editing. Furthermore, we develop ReasonEdit, which is trained using reward signals derived from RE-Reward and the Group Relative Policy Optimization (GRPO) algorithm to learn an interpretable evaluation model. Extensive experiments demonstrate that ReasonEdit achieves superior alignment with human preferences and exhibits strong generalization across public benchmarks. In addition, it is capable of generating high-quality interpretable evaluation text, enabling more transparent and trustworthy assessment for image editing. The code is available at https://github.com/IntMeGroup/ReasonEdit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ReasonEdit-22K, a dataset of 22K edited images paired with 113K Chain-of-Thought samples and 1.3M human judgments on logicality, accuracy, and usefulness. It proposes RE-Reward, an MLLM-based reward model trained on these judgments, and ReasonEdit, an interpretable evaluator trained via Group Relative Policy Optimization (GRPO) using reward signals from RE-Reward. The central claim is that ReasonEdit achieves superior alignment with human preferences, strong generalization on public benchmarks, and generates high-quality interpretable evaluation text for text-guided image editing.

Significance. If the empirical claims hold after addressing the training-loop concerns, the work would supply a new human-annotated resource and a reward-modeling pipeline for producing interpretable rather than scalar evaluations in image editing. The dataset size and the use of GRPO for policy optimization are concrete contributions that could be reused by the community, but the absence of reported metrics, baselines, or ablation results in the provided abstract limits any assessment of practical impact.

major comments (2)
  1. [Abstract] Abstract: the claim of 'superior alignment with human preferences and strong generalization across public benchmarks' is stated without any quantitative metrics, baseline comparisons, ablation studies, or numerical results. This omission makes it impossible to determine whether the data support the central claim.
  2. [Abstract and implied methodology] Dataset construction and training loop (implied in Abstract): RE-Reward is fit directly to the 1.3M human judgments collected on ReasonEdit-22K CoT samples, after which the same ReasonEdit-22K data is used to generate RE-Reward scores for GRPO training of ReasonEdit. No train/test split for the reward model, inter-annotator agreement statistics, or independent validation set is mentioned, raising a load-bearing risk that reported gains reflect reward hacking or data leakage rather than genuine interpretability.
minor comments (1)
  1. [Abstract] The abstract refers to 'extensive experiments' yet supplies no summary statistics or figures; a one-sentence results highlight would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have prepared revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'superior alignment with human preferences and strong generalization across public benchmarks' is stated without any quantitative metrics, baseline comparisons, ablation studies, or numerical results. This omission makes it impossible to determine whether the data support the central claim.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will add specific metrics (e.g., human alignment correlations and benchmark generalization scores) together with brief baseline comparisons drawn from the experiments section. This change will make the central claims directly verifiable from the abstract. revision: yes

  2. Referee: [Abstract and implied methodology] Dataset construction and training loop (implied in Abstract): RE-Reward is fit directly to the 1.3M human judgments collected on ReasonEdit-22K CoT samples, after which the same ReasonEdit-22K data is used to generate RE-Reward scores for GRPO training of ReasonEdit. No train/test split for the reward model, inter-annotator agreement statistics, or independent validation set is mentioned, raising a load-bearing risk that reported gains reflect reward hacking or data leakage rather than genuine interpretability.

    Authors: We acknowledge this methodological concern. We will revise the manuscript to explicitly document the data partitioning: RE-Reward was trained on a held-out subset of the 1.3M judgments with a separate validation split; inter-annotator agreement statistics will be reported; and the CoT samples used for GRPO were drawn from a disjoint portion of ReasonEdit-22K. These clarifications will demonstrate that the training procedure avoids leakage and that performance gains reflect genuine alignment. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent human annotations and external benchmarks

full rationale

The paper constructs ReasonEdit-22K with 1.3M external human judgments on logicality/accuracy/usefulness as the primary training signal. RE-Reward is fit to these judgments, after which ReasonEdit is optimized via GRPO using RE-Reward scores. The central claims (human alignment and generalization) are evaluated on public benchmarks that lie outside the training distribution. No equation or claim reduces a reported prediction to a quantity defined by the same fitted values; no self-citation supplies a uniqueness theorem or ansatz; the human signal is treated as an independent oracle rather than a self-generated loop. This is a standard reward-model-plus-RL pipeline with external validation, hence self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the quality and representativeness of the newly collected human judgments and the effectiveness of the RE-Reward + GRPO pipeline; limited information is available from the abstract alone.

axioms (1)
  • domain assumption Human judgments collected on logicality, accuracy, and usefulness of CoT samples provide a reliable training signal for reward models and downstream evaluators.
    The entire pipeline depends on the 1.3M human judgments described in the abstract.
invented entities (2)
  • RE-Reward no independent evidence
    purpose: MLLM-based reward model that scores interpretable reasoning in image edits
    New component introduced to generate human-aligned feedback.
  • ReasonEdit no independent evidence
    purpose: RL-trained model that produces interpretable evaluation text for image edits
    New model trained using GRPO and signals from RE-Reward.

pith-pipeline@v0.9.0 · 5564 in / 1401 out tokens · 59017 ms · 2026-05-11T01:47:39.678017+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 8 internal anchors

  1. [1]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai et al. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023

  2. [2]

    Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InACL Workshop, 2005

  3. [3]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InCVPR, 2023

  4. [4]

    Chen et al

    B. Chen et al. Topiq: A transformed-order prioritized image quality assessment.arXiv preprint arXiv:2308.XXXXX, 2023

  5. [5]

    Internvl2: Better and faster vision-language understanding.arXiv preprint arXiv:2407.XXXXX, 2024

    Zhe Chen et al. Internvl2: Better and faster vision-language understanding.arXiv preprint arXiv:2407.XXXXX, 2024

  6. [6]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  7. [7]

    Gemma: Open Models Based on Gemini Research and Technology

    Google DeepMind. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

  8. [8]

    Qlora: Efficient finetuning of quantized llms

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. InNeurIPS, 2023

  9. [9]

    Finevq: Fine-grained user generated content video quality assessment

    Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xiaoyun Zhang, and Guangtao Zhai. Finevq: Fine-grained user generated content video quality assessment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  10. [10]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, et al. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the International Conference on Machine Learning (ICML), 2024

  11. [11]

    Simcse: Simple contrastive learning of sentence embeddings

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. InEMNLP, 2021

  12. [12]

    Gemini 2.0: A next-generation multimodal model.Technical Report, 2024

    Google. Gemini 2.0: A next-generation multimodal model.Technical Report, 2024

  13. [13]

    Gemini 3.1 pro: Best for complex tasks and bringing creative concepts to life

    Google DeepMind. Gemini 3.1 pro: Best for complex tasks and bringing creative concepts to life. https://deepmind.google/models/gemini/pro/, 2025

  14. [14]

    UniREditBench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025

    Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, et al. Unireditbench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025

  15. [15]

    CLIPScore: A Reference-free Evaluation Metric for Image Captioning

    Jack Hessel et al. Clipscore: A reference-free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718, 2021

  16. [16]

    In- stantstyle: Free lunch towards style-preserving in text-to-image generation.arXiv preprint arXiv:2404.02733, 2024

    Shengding Hu et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2404.02733, 2024

  17. [17]

    Jiang et al

    Y . Jiang et al. Genai-bench: A comprehensive benchmark for generative ai.arXiv preprint arXiv:2406.XXXXX, 2024

  18. [18]

    Kirstain et al

    Y . Kirstain et al. Pick-a-pic: An open dataset of user preferences for text-to-image generation.NeurIPS, 2023

  19. [19]

    Pick-a-pic: An open dataset of user preferences for text-to-image analysis.NeurIPS, 2023

    Yuval Kirstain et al. Pick-a-pic: An open dataset of user preferences for text-to-image analysis.NeurIPS, 2023

  20. [20]

    arXiv preprint arXiv:2509.26346 (2025)

    Benno Krojer et al. Editreward: A human-aligned reward model for instruction-guided image editing. arXiv preprint arXiv:2509.26346, 2025

  21. [21]

    Learning action and reasoning-centric image editing from videos and simulation

    Benno Krojer, Dheeraj Vattikonda, et al. Learning action and reasoning-centric image editing from videos and simulation. InNeurIPS, 2024. 10

  22. [22]

    Fleur: An explainable reference-free evaluation metric for image captioning using a large multimodal model

    Yebin Lee, Imseong Park, and Myungjoo Kang. Fleur: An explainable reference-free evaluation metric for image captioning using a large multimodal model. InProceedings of the Association for Computational Linguistics (ACL), pages 3732–3746, 2024

  23. [23]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InACL Workshop, 2004

  24. [24]

    Lin et al

    Z. Lin et al. Vqa-score: Evaluating image-to-text generation via question answering.arXiv preprint arXiv:2403.XXXXX, 2024

  25. [25]

    Evaluating text-to-visual generation with image-to-text generation

    Z. Lin et al. Vqascore: Evaluating text-to-image generation with visual question answering.arXiv preprint arXiv:2404.01291, 2024

  26. [26]

    Unlocking the essence of beauty: Advanced aesthetic reasoning with relative-absolute policy optimization

    Boyang Liu, Yifan Hu, Senjie Jin, Shihan Dou, Gonglei Shi, Jie Shao, Tao Gui, and Xuanjing Huang. Unlocking the essence of beauty: Advanced aesthetic reasoning with relative-absolute policy optimization. arXiv preprint arXiv:2509.21871, 2025

  27. [27]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

  28. [28]

    Deepseek-vl2: Mixture-of-experts vision-language models.arXiv preprint arXiv:2412.XXXXX, 2024

    Haoyu Lu et al. Deepseek-vl2: Mixture-of-experts vision-language models.arXiv preprint arXiv:2412.XXXXX, 2024

  29. [29]

    Ovis2.5 technical report, 2025

    Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, et al. Ovis2.5 technical report. arXiv:2508.11737, 2025

  30. [30]

    Luo et al

    S. Luo et al. Ahiq: Attentive human-centric image quality assessment.arXiv preprint arXiv:2305.XXXXX, 2023

  31. [31]

    Distributed representations of words and phrases and their compositionality

    Tomas Mikolov et al. Distributed representations of words and phrases and their compositionality. In NeurIPS, 2013

  32. [32]

    No-reference image quality assessment in the spatial domain.IEEE TIP, 21(12):4695–4708, 2012

    Anish Mittal, Anush Krishna Moorthy, and Alan C Bovik. No-reference image quality assessment in the spatial domain.IEEE TIP, 21(12):4695–4708, 2012

  33. [33]

    completely blind

    Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a "completely blind" image quality analyzer. IEEE Signal Processing Letters, 20(3):209–212, 2012

  34. [34]

    A survey on zero-knowledge machine learning,

    S. Mo et al. Interpretable reward models via decomposable attribution.arXiv preprint arXiv:2501.01234, 2025

  35. [35]

    Blind image quality assessment: From natural scene statistics to perceptual quality.IEEE TIP, 20(12):3350–3364, 2011

    Anush Krishna Moorthy and Alan C Bovik. Blind image quality assessment: From natural scene statistics to perceptual quality.IEEE TIP, 20(12):3350–3364, 2011

  36. [36]

    Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024

    OpenAI. Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024

  37. [37]

    Gpt-5 technical report.Technical Report, 2025

    OpenAI. Gpt-5 technical report.Technical Report, 2025

  38. [38]

    Gpt image 1: State-of-the-art image generation model, 2025

    OpenAI. Gpt image 1: State-of-the-art image generation model, 2025. https://platform.openai. com/docs/models/gpt-image-1

  39. [39]

    Internvl 3.5: Open-source vision-language model.arXiv preprint, 2025

    OpenGVLab. Internvl 3.5: Open-source vision-language model.arXiv preprint, 2025

  40. [40]

    Peng et al

    S. Peng et al. Imagenhub: Standardizing the evaluation of conditional image generation.arXiv preprint arXiv:2310.XXXXX, 2023

  41. [41]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022

  42. [42]

    Positive-augmented contrastive learning for image and video captioning evaluation

    Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Positive-augmented contrastive learning for image and video captioning evaluation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6914–6924, 2023

  43. [43]

    Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint, 2025

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint, 2025

  44. [44]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, April 2024. arXiv:2402.03300 [cs]

  45. [45]

    Qwen3 technical report.arXiv preprint, 2025

    Alibaba Qwen Team. Qwen3 technical report.arXiv preprint, 2025. 11

  46. [46]

    Qwen3.5-omni technical report

    Qwen Team. Qwen3.5-omni technical report

  47. [47]

    Editscore: Unlocking online rl for image editing via high-fidelity reward modeling

    VectorSpaceLab. Editscore: Unlocking online rl for image editing via high-fidelity reward modeling. In ICLR, 2026

  48. [48]

    Creval: An automated interpretable evaluation for creative image manipulation under complex instructions.arXiv preprint arXiv:2603.26174, 2026

    Chonghuinan Wang, Zihan Chen, Yuxiang Wei, Tianyi Jiang, Xiaohe Wu, Fan Li, Wangmeng Zuo, and Hongxun Yao. Creval: An automated interpretable evaluation for creative image manipulation under complex instructions.arXiv preprint arXiv:2603.26174, 2026

  49. [49]

    Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms

    Jiarui Wang, Huiyu Duan, Yu Zhao, Juntong Wang, Guangtao Zhai, and Xiongkuo Min. Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17312–17323, 2025

  50. [50]

    Image quality assessment: from error visibility to structural similarity.IEEE TIP, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE TIP, 13(4):600–612, 2004

  51. [51]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  52. [52]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

  53. [53]

    Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025

    Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. VisualQuality-R1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025

  54. [54]

    arXiv preprint arXiv:2510.06679 (2025)

    Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, et al. Dreamomni2: Multimodal instruction-based editing and generation.arXiv preprint arXiv:2510.06679, 2025

  55. [55]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jialite Xu et al. Imagereward: Learning and evaluating human preferences for text-to-image generation. NeurIPS, 2023

  56. [56]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu et al. Imagereward: Learning and evaluating human preferences for text-to-image generation. NeurIPS, 2023

  57. [57]

    Edithf-1m: A million-scale rich hu- man preference feedback for image editing.arXiv preprint arXiv:2603.14916, 2026

    Zitong Xu, Huiyu Duan, Zhongpeng Ji, Xinyun Zhang, Yutao Liu, Xiongkuo Min, et al. Edithf-1m: A million-scale rich human preference feedback for image editing.arXiv preprint arXiv:2603.14916, 2026

  58. [58]

    Harmonyiqa: Pioneering benchmark and model for image harmonization quality assessment

    Zitong Xu, Huiyu Duan, Guangji Ma, Liu Yang, Jiarui Wang, Qingbo Wu, et al. Harmonyiqa: Pioneering benchmark and model for image harmonization quality assessment. InIEEE International Conference on Multimedia and Expo (ICME), pages 1–6, 2025

  59. [59]

    Lmm4edit: Benchmarking and evaluating multimodal image editing with lmms.arXiv preprint arXiv:2507.16193, 2025

    Zitong Xu et al. Lmm4edit: Benchmarking and evaluating multimodal image editing with lmms.arXiv preprint arXiv:2507.16193, 2025

  60. [60]

    Gradient magnitude similarity deviation: A highly efficient perceptual image quality index.IEEE TIP, 23(2):684–695, 2013

    Wufeng Xue, Lei Zhang, Xuanqin Mou, and Alan C Bovik. Gradient magnitude similarity deviation: A highly efficient perceptual image quality index.IEEE TIP, 23(2):684–695, 2013

  61. [61]

    Maniqa: Multi-dimension attention network for no-reference image quality assessment

    Pengfei Yang et al. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InCVPR Workshops, 2022

  62. [62]

    Image quality assessment based on the perceived structural similarity index of an image.Mathematical Biosciences and Engineering, 20(5):9385–9409, 2023

    Juncai Yao, Jing Shen, and Congying Yao. Image quality assessment based on the perceived structural similarity index of an image.Mathematical Biosciences and Engineering, 20(5):9385–9409, 2023

  63. [63]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye et al. mplug-owl: Modularization empowers large hallucination-aware multimodal models. arXiv preprint arXiv:2304.14178, 2023

  64. [64]

    Content-variant reference image quality assessment via knowledge distillation

    Guanghao Yin, Wei Wang, Zehuan Yuan, et al. Content-variant reference image quality assessment via knowledge distillation. InAAAI, volume 36, pages 3134–3142, 2022

  65. [65]

    Magicbrush: A large-scale dataset for instruction-guided real image editing.NeurIPS, 2024

    Kai Zhang et al. Magicbrush: A large-scale dataset for instruction-guided real image editing.NeurIPS, 2024

  66. [66]

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    Renrui Zhang et al. Llama-adapter: Efficient fine-tuning of language models with zero-init attention.arXiv preprint arXiv:2303.16199, 2023

  67. [67]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, pages 586–595, 2018

  68. [68]

    arXiv preprint arXiv:2312.17090 (2023)

    Wu Zhang et al. Q-align: Teaching lmms for visual scoring via language-to-score alignment.arXiv preprint arXiv:2312.17090, 2023. 12

  69. [69]

    Critique-llm: Scaling feedback generation for large language models.arXiv preprint arXiv:2405.00123, 2024

    Chujie Zheng et al. Critique-llm: Scaling feedback generation for large language models.arXiv preprint arXiv:2405.00123, 2024

  70. [70]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, et al. Internvl3: Ex- ploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13 Supplementary Material A Overview This supplementary material provides additional details for the data construction, annotation pro- tocol...

  71. [73]

    Reward Scores

    Content Preservation (Content consistency) E.g., consistency of the main structure with the original, preservation of unedited areas, style consistency. [Final Assessment] After outputting [Final Assessment], immediately continue with exactly three scores for Vi- sual Quality, Editing Alignment, and Content Preservation in one line, separated by commas, w...

  72. [74]

    logicality: internal consistency, coherent reasoning, and absence of contradictions

  73. [75]

    accuracy: factual alignment with the source image, edited image, and editing instruction

  74. [76]

    up_proj",

    usefulness: specificity, diagnostic value, and usefulness for reward modeling. Summarize the grounded evidence into the final anchor token sequence for regression. Reward Scores: D Details of ReasonEdit D.1 Dual-head model architecture ReasonEdit is a multimodal generator-regressor for interpretable TIE evaluation. It takes the source image, edited image,...

  75. [77]

    Visual Quality (Naturalness of the edit and image) E.g., lighting, clarity, color, details, realism, etc

  76. [78]

    Editing Alignment (Adherence to editing instructions) Whether the instruction is fully or partially implemented, and the effectiveness of the imple- mentation

  77. [79]

    logicality

    Content Preservation (Content consistency) E.g., consistency of the main structure with the original, preservation of unedited areas, style consistency. [Final Assessment] After outputting [Final Assessment], immediately continue with exactly three scores for Vi- sual Quality, Editing Alignment, and Content Preservation in one line, separated by commas, w...