arxiv: 2605.07477 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning

Honghua Chen , Zitong Xu , Huiyu Duan , Xinyun Zhang , Xiongkuo Min , Guangtao Zhai

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords image editing evaluationinterpretable reasoningreinforcement learningchain-of-thoughtmultimodal large language modelshuman preference alignmenttext-guided image editing

0 comments

The pith

ReasonEdit trains an interpretable evaluator for text-guided image editing using reinforcement learning on human judgments of reasoning quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing evaluation of text-guided image edits relies on single numeric scores that reveal little about specific failures or successes. This paper creates ReasonEdit-22K, a dataset of 22K edited images paired with 113K chain-of-thought explanations and 1.3 million human ratings on the logicality, accuracy, and usefulness of those explanations. It then builds RE-Reward, a multimodal model that scores the quality of such reasoning, and uses the Group Relative Policy Optimization algorithm to train ReasonEdit as an evaluator that outputs both a judgment and readable supporting text. If the approach holds, evaluation shifts from opaque numbers to transparent explanations that developers and users can inspect and trust. This matters because clearer feedback can accelerate improvement of image editing systems by highlighting exactly where edits go wrong.

Core claim

ReasonEdit is an evaluation model for text-guided image editing trained with the Group Relative Policy Optimization algorithm on reward signals from RE-Reward. RE-Reward is a multimodal large language model that scores chain-of-thought interpretations according to human ratings of logicality, accuracy, and usefulness. The training data comes from the ReasonEdit-22K dataset of 22K edited images, 113K chain-of-thought samples, and 1.3M human judgments. On public benchmarks the resulting model aligns more closely with human preferences than prior scalar methods while also generating high-quality interpretable evaluation text.

What carries the argument

RE-Reward, an MLLM-based model that scores chain-of-thought reasoning chains on logicality, accuracy, and usefulness, whose signals then train ReasonEdit via the Group Relative Policy Optimization algorithm.

If this is right

Evaluation of image edits gains readable supporting text instead of single scores.
The trained model generalizes to multiple public benchmarks while preserving human alignment.
Developers obtain specific, inspectable feedback on artifacts and unintended changes.
Assessment of text-guided editing becomes more transparent and therefore more actionable.
Future editing models can be iterated using the same interpretable signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dataset-plus-reward-plus-GRPO pipeline could be applied to evaluation of other generative tasks such as video or 3D editing.
The generated reasoning text might be fed back into editing models to guide iterative refinement without additional human labels.
Systematic patterns in the model's explanations could expose recurring weaknesses in current text-guided editing techniques.

Load-bearing premise

The 1.3 million human judgments on logicality, accuracy, and usefulness collected for the ReasonEdit-22K dataset form a reliable and unbiased training signal that produces generalizable evaluators without circular dependence on the same ratings.

What would settle it

Collect fresh human ratings on the logicality, accuracy, and usefulness of evaluation text produced by ReasonEdit versus baseline methods on a large set of edited images never seen during training; if ReasonEdit text receives lower average ratings, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.07477 by Guangtao Zhai, Honghua Chen, Huiyu Duan, Xinyun Zhang, Xiongkuo Min, Zitong Xu.

**Figure 2.** Figure 2: Annotation workflow strategies to guide the MLLMs to generate structured CoT reasoning alongside three-dimensional sub-scores regarding the visual quality, instruction alignment and content preservation, as well as an overall quality score. Finally, 113,898 CoT texts were successfully obtained. 3.2.2 CoT scoring and evaluation We then conducted the annotation for the CoT texts above via a custom-built web … view at source ↗

**Figure 3.** Figure 3: Overview of the RE-Reward architecture and SFT training [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the ReasonEdit architecture and SFT training [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Task-type distribution in ReasonEdit-22K [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Annotation interface for scoring candidate critiques [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: The raw 1–5 ordinal scores after the annotation of ReasonEdit-Reward-113K [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Example where interpretable reasoning reveals a failure not captured by a single scalar [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

read the original abstract

Recent text-guided image editing (TIE) models have achieved remarkable progress, however, many edited results still suffer from artifacts, unintended modifications, and suboptimal aesthetics. Although several benchmarks and evaluation methods have been proposed, most existing approaches rely on scalar scores and lack interpretability. This limitation largely stems from the absence of high-quality interpretation datasets for TIE and effective reward models to train interpretable evaluators. To address these challenges, we introduce ReasonEdit-22K, the first dataset that combines 22K edited images with 113K Chain-of-Thought (CoT) samples, along with 1.3M human judgments assessing these interpretations in terms of logicality, accuracy, and usefulness. Building upon this dataset, we propose RE-Reward, a multimodal large language model (MLLM)-based reward model designed to provide human-aligned feedback for evaluating interpretable reasoning in image editing. Furthermore, we develop ReasonEdit, which is trained using reward signals derived from RE-Reward and the Group Relative Policy Optimization (GRPO) algorithm to learn an interpretable evaluation model. Extensive experiments demonstrate that ReasonEdit achieves superior alignment with human preferences and exhibits strong generalization across public benchmarks. In addition, it is capable of generating high-quality interpretable evaluation text, enabling more transparent and trustworthy assessment for image editing. The code is available at https://github.com/IntMeGroup/ReasonEdit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ships a new dataset of edited images with CoT samples and human judgments, then trains a reward model and GRPO policy on the same data, which undercuts the generalization claims.

read the letter

The main takeaway is that ReasonEdit-22K gives the field its first sizable collection of text-guided edits paired with chain-of-thought interpretations and 1.3 million human ratings on logicality, accuracy, and usefulness. From there the authors train RE-Reward to score those interpretations and then run GRPO to produce ReasonEdit, an evaluator that outputs readable text instead of single numbers. That dataset release is the concrete new piece; prior TIE benchmarks mostly stopped at scalar metrics, so having annotated reasoning traces is a practical step forward that other groups can use directly. The code link is also helpful for checking the implementation. The training loop is the soft spot. RE-Reward is fit to the human judgments on the 22K set, and ReasonEdit is then optimized with rewards drawn from that same set of CoT samples. The abstract does not report train/test splits for the reward model, inter-annotator agreement numbers, or any leakage checks against the public benchmarks used for generalization tests. Without those details the reported gains in human alignment could reflect overfitting to the annotation distribution rather than genuine improvement. The claims of strong generalization therefore rest on evidence that is not yet visible in the summary. This work is aimed at researchers who build or evaluate image-editing models and want more transparent metrics. It is worth sending to peer review because the dataset itself is a usable artifact even if the current training procedure needs tighter controls and clearer ablations. A referee can ask for the missing splits and independent validation runs without dismissing the contribution outright.

Referee Report

2 major / 1 minor

Summary. The paper introduces ReasonEdit-22K, a dataset of 22K edited images paired with 113K Chain-of-Thought samples and 1.3M human judgments on logicality, accuracy, and usefulness. It proposes RE-Reward, an MLLM-based reward model trained on these judgments, and ReasonEdit, an interpretable evaluator trained via Group Relative Policy Optimization (GRPO) using reward signals from RE-Reward. The central claim is that ReasonEdit achieves superior alignment with human preferences, strong generalization on public benchmarks, and generates high-quality interpretable evaluation text for text-guided image editing.

Significance. If the empirical claims hold after addressing the training-loop concerns, the work would supply a new human-annotated resource and a reward-modeling pipeline for producing interpretable rather than scalar evaluations in image editing. The dataset size and the use of GRPO for policy optimization are concrete contributions that could be reused by the community, but the absence of reported metrics, baselines, or ablation results in the provided abstract limits any assessment of practical impact.

major comments (2)

[Abstract] Abstract: the claim of 'superior alignment with human preferences and strong generalization across public benchmarks' is stated without any quantitative metrics, baseline comparisons, ablation studies, or numerical results. This omission makes it impossible to determine whether the data support the central claim.
[Abstract and implied methodology] Dataset construction and training loop (implied in Abstract): RE-Reward is fit directly to the 1.3M human judgments collected on ReasonEdit-22K CoT samples, after which the same ReasonEdit-22K data is used to generate RE-Reward scores for GRPO training of ReasonEdit. No train/test split for the reward model, inter-annotator agreement statistics, or independent validation set is mentioned, raising a load-bearing risk that reported gains reflect reward hacking or data leakage rather than genuine interpretability.

minor comments (1)

[Abstract] The abstract refers to 'extensive experiments' yet supplies no summary statistics or figures; a one-sentence results highlight would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have prepared revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'superior alignment with human preferences and strong generalization across public benchmarks' is stated without any quantitative metrics, baseline comparisons, ablation studies, or numerical results. This omission makes it impossible to determine whether the data support the central claim.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will add specific metrics (e.g., human alignment correlations and benchmark generalization scores) together with brief baseline comparisons drawn from the experiments section. This change will make the central claims directly verifiable from the abstract. revision: yes
Referee: [Abstract and implied methodology] Dataset construction and training loop (implied in Abstract): RE-Reward is fit directly to the 1.3M human judgments collected on ReasonEdit-22K CoT samples, after which the same ReasonEdit-22K data is used to generate RE-Reward scores for GRPO training of ReasonEdit. No train/test split for the reward model, inter-annotator agreement statistics, or independent validation set is mentioned, raising a load-bearing risk that reported gains reflect reward hacking or data leakage rather than genuine interpretability.

Authors: We acknowledge this methodological concern. We will revise the manuscript to explicitly document the data partitioning: RE-Reward was trained on a held-out subset of the 1.3M judgments with a separate validation split; inter-annotator agreement statistics will be reported; and the CoT samples used for GRPO were drawn from a disjoint portion of ReasonEdit-22K. These clarifications will demonstrate that the training procedure avoids leakage and that performance gains reflect genuine alignment. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent human annotations and external benchmarks

full rationale

The paper constructs ReasonEdit-22K with 1.3M external human judgments on logicality/accuracy/usefulness as the primary training signal. RE-Reward is fit to these judgments, after which ReasonEdit is optimized via GRPO using RE-Reward scores. The central claims (human alignment and generalization) are evaluated on public benchmarks that lie outside the training distribution. No equation or claim reduces a reported prediction to a quantity defined by the same fitted values; no self-citation supplies a uniqueness theorem or ansatz; the human signal is treated as an independent oracle rather than a self-generated loop. This is a standard reward-model-plus-RL pipeline with external validation, hence self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the quality and representativeness of the newly collected human judgments and the effectiveness of the RE-Reward + GRPO pipeline; limited information is available from the abstract alone.

axioms (1)

domain assumption Human judgments collected on logicality, accuracy, and usefulness of CoT samples provide a reliable training signal for reward models and downstream evaluators.
The entire pipeline depends on the 1.3M human judgments described in the abstract.

invented entities (2)

RE-Reward no independent evidence
purpose: MLLM-based reward model that scores interpretable reasoning in image edits
New component introduced to generate human-aligned feedback.
ReasonEdit no independent evidence
purpose: RL-trained model that produces interpretable evaluation text for image edits
New model trained using GRPO and signals from RE-Reward.

pith-pipeline@v0.9.0 · 5564 in / 1401 out tokens · 59017 ms · 2026-05-11T01:47:39.678017+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RE-Reward... trained using reward signals derived from RE-Reward and the Group Relative Policy Optimization (GRPO) algorithm... 1.3M human judgments assessing these interpretations in terms of logicality, accuracy, and usefulness
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ReasonEdit... trained via supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 8 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai et al. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InACL Workshop, 2005

work page 2005
[3]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InCVPR, 2023

work page 2023
[4]

Chen et al

B. Chen et al. Topiq: A transformed-order prioritized image quality assessment.arXiv preprint arXiv:2308.XXXXX, 2023

work page 2023
[5]

Internvl2: Better and faster vision-language understanding.arXiv preprint arXiv:2407.XXXXX, 2024

Zhe Chen et al. Internvl2: Better and faster vision-language understanding.arXiv preprint arXiv:2407.XXXXX, 2024

work page 2024
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Gemma: Open Models Based on Gemini Research and Technology

Google DeepMind. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Qlora: Efficient finetuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. InNeurIPS, 2023

work page 2023
[9]

Finevq: Fine-grained user generated content video quality assessment

Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xiaoyun Zhang, and Guangtao Zhai. Finevq: Fine-grained user generated content video quality assessment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[10]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, et al. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the International Conference on Machine Learning (ICML), 2024

work page 2024
[11]

Simcse: Simple contrastive learning of sentence embeddings

Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. InEMNLP, 2021

work page 2021
[12]

Gemini 2.0: A next-generation multimodal model.Technical Report, 2024

Google. Gemini 2.0: A next-generation multimodal model.Technical Report, 2024

work page 2024
[13]

Gemini 3.1 pro: Best for complex tasks and bringing creative concepts to life

Google DeepMind. Gemini 3.1 pro: Best for complex tasks and bringing creative concepts to life. https://deepmind.google/models/gemini/pro/, 2025

work page 2025
[14]

UniREditBench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025

Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, et al. Unireditbench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025

work page arXiv 2025
[15]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel et al. Clipscore: A reference-free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718, 2021

work page internal anchor Pith review arXiv 2021
[16]

In- stantstyle: Free lunch towards style-preserving in text-to-image generation.arXiv preprint arXiv:2404.02733, 2024

Shengding Hu et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2404.02733, 2024

work page arXiv 2024
[17]

Jiang et al

Y . Jiang et al. Genai-bench: A comprehensive benchmark for generative ai.arXiv preprint arXiv:2406.XXXXX, 2024

work page 2024
[18]

Kirstain et al

Y . Kirstain et al. Pick-a-pic: An open dataset of user preferences for text-to-image generation.NeurIPS, 2023

work page 2023
[19]

Pick-a-pic: An open dataset of user preferences for text-to-image analysis.NeurIPS, 2023

Yuval Kirstain et al. Pick-a-pic: An open dataset of user preferences for text-to-image analysis.NeurIPS, 2023

work page 2023
[20]

arXiv preprint arXiv:2509.26346 (2025)

Benno Krojer et al. Editreward: A human-aligned reward model for instruction-guided image editing. arXiv preprint arXiv:2509.26346, 2025

work page arXiv 2025
[21]

Learning action and reasoning-centric image editing from videos and simulation

Benno Krojer, Dheeraj Vattikonda, et al. Learning action and reasoning-centric image editing from videos and simulation. InNeurIPS, 2024. 10

work page 2024
[22]

Fleur: An explainable reference-free evaluation metric for image captioning using a large multimodal model

Yebin Lee, Imseong Park, and Myungjoo Kang. Fleur: An explainable reference-free evaluation metric for image captioning using a large multimodal model. InProceedings of the Association for Computational Linguistics (ACL), pages 3732–3746, 2024

work page 2024
[23]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InACL Workshop, 2004

work page 2004
[24]

Lin et al

Z. Lin et al. Vqa-score: Evaluating image-to-text generation via question answering.arXiv preprint arXiv:2403.XXXXX, 2024

work page 2024
[25]

Evaluating text-to-visual generation with image-to-text generation

Z. Lin et al. Vqascore: Evaluating text-to-image generation with visual question answering.arXiv preprint arXiv:2404.01291, 2024

work page arXiv 2024
[26]

Unlocking the essence of beauty: Advanced aesthetic reasoning with relative-absolute policy optimization

Boyang Liu, Yifan Hu, Senjie Jin, Shihan Dou, Gonglei Shi, Jie Shao, Tao Gui, and Xuanjing Huang. Unlocking the essence of beauty: Advanced aesthetic reasoning with relative-absolute policy optimization. arXiv preprint arXiv:2509.21871, 2025

work page arXiv 2025
[27]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

work page 2023
[28]

Deepseek-vl2: Mixture-of-experts vision-language models.arXiv preprint arXiv:2412.XXXXX, 2024

Haoyu Lu et al. Deepseek-vl2: Mixture-of-experts vision-language models.arXiv preprint arXiv:2412.XXXXX, 2024

work page 2024
[29]

Ovis2.5 technical report, 2025

Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, et al. Ovis2.5 technical report. arXiv:2508.11737, 2025

work page arXiv 2025
[30]

Luo et al

S. Luo et al. Ahiq: Attentive human-centric image quality assessment.arXiv preprint arXiv:2305.XXXXX, 2023

work page 2023
[31]

Distributed representations of words and phrases and their compositionality

Tomas Mikolov et al. Distributed representations of words and phrases and their compositionality. In NeurIPS, 2013

work page 2013
[32]

No-reference image quality assessment in the spatial domain.IEEE TIP, 21(12):4695–4708, 2012

Anish Mittal, Anush Krishna Moorthy, and Alan C Bovik. No-reference image quality assessment in the spatial domain.IEEE TIP, 21(12):4695–4708, 2012

work page 2012
[33]

completely blind

Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a "completely blind" image quality analyzer. IEEE Signal Processing Letters, 20(3):209–212, 2012

work page 2012
[34]

A survey on zero-knowledge machine learning,

S. Mo et al. Interpretable reward models via decomposable attribution.arXiv preprint arXiv:2501.01234, 2025

work page arXiv 2025
[35]

Blind image quality assessment: From natural scene statistics to perceptual quality.IEEE TIP, 20(12):3350–3364, 2011

Anush Krishna Moorthy and Alan C Bovik. Blind image quality assessment: From natural scene statistics to perceptual quality.IEEE TIP, 20(12):3350–3364, 2011

work page 2011
[36]

Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024

OpenAI. Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024

work page 2024
[37]

Gpt-5 technical report.Technical Report, 2025

OpenAI. Gpt-5 technical report.Technical Report, 2025

work page 2025
[38]

Gpt image 1: State-of-the-art image generation model, 2025

OpenAI. Gpt image 1: State-of-the-art image generation model, 2025. https://platform.openai. com/docs/models/gpt-image-1

work page 2025
[39]

Internvl 3.5: Open-source vision-language model.arXiv preprint, 2025

OpenGVLab. Internvl 3.5: Open-source vision-language model.arXiv preprint, 2025

work page 2025
[40]

Peng et al

S. Peng et al. Imagenhub: Standardizing the evaluation of conditional image generation.arXiv preprint arXiv:2310.XXXXX, 2023

work page 2023
[41]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022

work page 2022
[42]

Positive-augmented contrastive learning for image and video captioning evaluation

Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Positive-augmented contrastive learning for image and video captioning evaluation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6914–6924, 2023

work page 2023
[43]

Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint, 2025

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint, 2025

work page 2025
[44]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, April 2024. arXiv:2402.03300 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Qwen3 technical report.arXiv preprint, 2025

Alibaba Qwen Team. Qwen3 technical report.arXiv preprint, 2025. 11

work page 2025
[46]

Qwen3.5-omni technical report

Qwen Team. Qwen3.5-omni technical report

work page
[47]

Editscore: Unlocking online rl for image editing via high-fidelity reward modeling

VectorSpaceLab. Editscore: Unlocking online rl for image editing via high-fidelity reward modeling. In ICLR, 2026

work page 2026
[48]

Creval: An automated interpretable evaluation for creative image manipulation under complex instructions.arXiv preprint arXiv:2603.26174, 2026

Chonghuinan Wang, Zihan Chen, Yuxiang Wei, Tianyi Jiang, Xiaohe Wu, Fan Li, Wangmeng Zuo, and Hongxun Yao. Creval: An automated interpretable evaluation for creative image manipulation under complex instructions.arXiv preprint arXiv:2603.26174, 2026

work page arXiv 2026
[49]

Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms

Jiarui Wang, Huiyu Duan, Yu Zhao, Juntong Wang, Guangtao Zhai, and Xiongkuo Min. Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17312–17323, 2025

work page 2025
[50]

Image quality assessment: from error visibility to structural similarity.IEEE TIP, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE TIP, 13(4):600–612, 2004

work page 2004
[51]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025

Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. VisualQuality-R1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025

work page arXiv 2025
[54]

arXiv preprint arXiv:2510.06679 (2025)

Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, et al. Dreamomni2: Multimodal instruction-based editing and generation.arXiv preprint arXiv:2510.06679, 2025

work page arXiv 2025
[55]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jialite Xu et al. Imagereward: Learning and evaluating human preferences for text-to-image generation. NeurIPS, 2023

work page 2023
[56]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu et al. Imagereward: Learning and evaluating human preferences for text-to-image generation. NeurIPS, 2023

work page 2023
[57]

Edithf-1m: A million-scale rich hu- man preference feedback for image editing.arXiv preprint arXiv:2603.14916, 2026

Zitong Xu, Huiyu Duan, Zhongpeng Ji, Xinyun Zhang, Yutao Liu, Xiongkuo Min, et al. Edithf-1m: A million-scale rich human preference feedback for image editing.arXiv preprint arXiv:2603.14916, 2026

work page arXiv 2026
[58]

Harmonyiqa: Pioneering benchmark and model for image harmonization quality assessment

Zitong Xu, Huiyu Duan, Guangji Ma, Liu Yang, Jiarui Wang, Qingbo Wu, et al. Harmonyiqa: Pioneering benchmark and model for image harmonization quality assessment. InIEEE International Conference on Multimedia and Expo (ICME), pages 1–6, 2025

work page 2025
[59]

Lmm4edit: Benchmarking and evaluating multimodal image editing with lmms.arXiv preprint arXiv:2507.16193, 2025

Zitong Xu et al. Lmm4edit: Benchmarking and evaluating multimodal image editing with lmms.arXiv preprint arXiv:2507.16193, 2025

work page arXiv 2025
[60]

Gradient magnitude similarity deviation: A highly efficient perceptual image quality index.IEEE TIP, 23(2):684–695, 2013

Wufeng Xue, Lei Zhang, Xuanqin Mou, and Alan C Bovik. Gradient magnitude similarity deviation: A highly efficient perceptual image quality index.IEEE TIP, 23(2):684–695, 2013

work page 2013
[61]

Maniqa: Multi-dimension attention network for no-reference image quality assessment

Pengfei Yang et al. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InCVPR Workshops, 2022

work page 2022
[62]

Image quality assessment based on the perceived structural similarity index of an image.Mathematical Biosciences and Engineering, 20(5):9385–9409, 2023

Juncai Yao, Jing Shen, and Congying Yao. Image quality assessment based on the perceived structural similarity index of an image.Mathematical Biosciences and Engineering, 20(5):9385–9409, 2023

work page 2023
[63]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye et al. mplug-owl: Modularization empowers large hallucination-aware multimodal models. arXiv preprint arXiv:2304.14178, 2023

work page Pith review arXiv 2023
[64]

Content-variant reference image quality assessment via knowledge distillation

Guanghao Yin, Wei Wang, Zehuan Yuan, et al. Content-variant reference image quality assessment via knowledge distillation. InAAAI, volume 36, pages 3134–3142, 2022

work page 2022
[65]

Magicbrush: A large-scale dataset for instruction-guided real image editing.NeurIPS, 2024

Kai Zhang et al. Magicbrush: A large-scale dataset for instruction-guided real image editing.NeurIPS, 2024

work page 2024
[66]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang et al. Llama-adapter: Efficient fine-tuning of language models with zero-init attention.arXiv preprint arXiv:2303.16199, 2023

work page Pith review arXiv 2023
[67]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, pages 586–595, 2018

work page 2018
[68]

arXiv preprint arXiv:2312.17090 (2023)

Wu Zhang et al. Q-align: Teaching lmms for visual scoring via language-to-score alignment.arXiv preprint arXiv:2312.17090, 2023. 12

work page arXiv 2023
[69]

Critique-llm: Scaling feedback generation for large language models.arXiv preprint arXiv:2405.00123, 2024

Chujie Zheng et al. Critique-llm: Scaling feedback generation for large language models.arXiv preprint arXiv:2405.00123, 2024

work page arXiv 2024
[70]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, et al. Internvl3: Ex- ploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13 Supplementary Material A Overview This supplementary material provides additional details for the data construction, annotation pro- tocol...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Reward Scores

Content Preservation (Content consistency) E.g., consistency of the main structure with the original, preservation of unedited areas, style consistency. [Final Assessment] After outputting [Final Assessment], immediately continue with exactly three scores for Vi- sual Quality, Editing Alignment, and Content Preservation in one line, separated by commas, w...

work page
[74]

logicality: internal consistency, coherent reasoning, and absence of contradictions

work page
[75]

accuracy: factual alignment with the source image, edited image, and editing instruction

work page
[76]

up_proj",

usefulness: specificity, diagnostic value, and usefulness for reward modeling. Summarize the grounded evidence into the final anchor token sequence for regression. Reward Scores: D Details of ReasonEdit D.1 Dual-head model architecture ReasonEdit is a multimodal generator-regressor for interpretable TIE evaluation. It takes the source image, edited image,...

work page
[77]

Visual Quality (Naturalness of the edit and image) E.g., lighting, clarity, color, details, realism, etc

work page
[78]

Editing Alignment (Adherence to editing instructions) Whether the instruction is fully or partially implemented, and the effectiveness of the imple- mentation

work page
[79]

logicality

Content Preservation (Content consistency) E.g., consistency of the main structure with the original, preservation of unedited areas, style consistency. [Final Assessment] After outputting [Final Assessment], immediately continue with exactly three scores for Vi- sual Quality, Editing Alignment, and Content Preservation in one line, separated by commas, w...

work page