Recognition: 2 theorem links
· Lean TheoremReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning
Pith reviewed 2026-05-11 01:47 UTC · model grok-4.3
The pith
ReasonEdit trains an interpretable evaluator for text-guided image editing using reinforcement learning on human judgments of reasoning quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReasonEdit is an evaluation model for text-guided image editing trained with the Group Relative Policy Optimization algorithm on reward signals from RE-Reward. RE-Reward is a multimodal large language model that scores chain-of-thought interpretations according to human ratings of logicality, accuracy, and usefulness. The training data comes from the ReasonEdit-22K dataset of 22K edited images, 113K chain-of-thought samples, and 1.3M human judgments. On public benchmarks the resulting model aligns more closely with human preferences than prior scalar methods while also generating high-quality interpretable evaluation text.
What carries the argument
RE-Reward, an MLLM-based model that scores chain-of-thought reasoning chains on logicality, accuracy, and usefulness, whose signals then train ReasonEdit via the Group Relative Policy Optimization algorithm.
If this is right
- Evaluation of image edits gains readable supporting text instead of single scores.
- The trained model generalizes to multiple public benchmarks while preserving human alignment.
- Developers obtain specific, inspectable feedback on artifacts and unintended changes.
- Assessment of text-guided editing becomes more transparent and therefore more actionable.
- Future editing models can be iterated using the same interpretable signals.
Where Pith is reading between the lines
- The same dataset-plus-reward-plus-GRPO pipeline could be applied to evaluation of other generative tasks such as video or 3D editing.
- The generated reasoning text might be fed back into editing models to guide iterative refinement without additional human labels.
- Systematic patterns in the model's explanations could expose recurring weaknesses in current text-guided editing techniques.
Load-bearing premise
The 1.3 million human judgments on logicality, accuracy, and usefulness collected for the ReasonEdit-22K dataset form a reliable and unbiased training signal that produces generalizable evaluators without circular dependence on the same ratings.
What would settle it
Collect fresh human ratings on the logicality, accuracy, and usefulness of evaluation text produced by ReasonEdit versus baseline methods on a large set of edited images never seen during training; if ReasonEdit text receives lower average ratings, the central claim fails.
Figures
read the original abstract
Recent text-guided image editing (TIE) models have achieved remarkable progress, however, many edited results still suffer from artifacts, unintended modifications, and suboptimal aesthetics. Although several benchmarks and evaluation methods have been proposed, most existing approaches rely on scalar scores and lack interpretability. This limitation largely stems from the absence of high-quality interpretation datasets for TIE and effective reward models to train interpretable evaluators. To address these challenges, we introduce ReasonEdit-22K, the first dataset that combines 22K edited images with 113K Chain-of-Thought (CoT) samples, along with 1.3M human judgments assessing these interpretations in terms of logicality, accuracy, and usefulness. Building upon this dataset, we propose RE-Reward, a multimodal large language model (MLLM)-based reward model designed to provide human-aligned feedback for evaluating interpretable reasoning in image editing. Furthermore, we develop ReasonEdit, which is trained using reward signals derived from RE-Reward and the Group Relative Policy Optimization (GRPO) algorithm to learn an interpretable evaluation model. Extensive experiments demonstrate that ReasonEdit achieves superior alignment with human preferences and exhibits strong generalization across public benchmarks. In addition, it is capable of generating high-quality interpretable evaluation text, enabling more transparent and trustworthy assessment for image editing. The code is available at https://github.com/IntMeGroup/ReasonEdit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ReasonEdit-22K, a dataset of 22K edited images paired with 113K Chain-of-Thought samples and 1.3M human judgments on logicality, accuracy, and usefulness. It proposes RE-Reward, an MLLM-based reward model trained on these judgments, and ReasonEdit, an interpretable evaluator trained via Group Relative Policy Optimization (GRPO) using reward signals from RE-Reward. The central claim is that ReasonEdit achieves superior alignment with human preferences, strong generalization on public benchmarks, and generates high-quality interpretable evaluation text for text-guided image editing.
Significance. If the empirical claims hold after addressing the training-loop concerns, the work would supply a new human-annotated resource and a reward-modeling pipeline for producing interpretable rather than scalar evaluations in image editing. The dataset size and the use of GRPO for policy optimization are concrete contributions that could be reused by the community, but the absence of reported metrics, baselines, or ablation results in the provided abstract limits any assessment of practical impact.
major comments (2)
- [Abstract] Abstract: the claim of 'superior alignment with human preferences and strong generalization across public benchmarks' is stated without any quantitative metrics, baseline comparisons, ablation studies, or numerical results. This omission makes it impossible to determine whether the data support the central claim.
- [Abstract and implied methodology] Dataset construction and training loop (implied in Abstract): RE-Reward is fit directly to the 1.3M human judgments collected on ReasonEdit-22K CoT samples, after which the same ReasonEdit-22K data is used to generate RE-Reward scores for GRPO training of ReasonEdit. No train/test split for the reward model, inter-annotator agreement statistics, or independent validation set is mentioned, raising a load-bearing risk that reported gains reflect reward hacking or data leakage rather than genuine interpretability.
minor comments (1)
- [Abstract] The abstract refers to 'extensive experiments' yet supplies no summary statistics or figures; a one-sentence results highlight would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have prepared revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'superior alignment with human preferences and strong generalization across public benchmarks' is stated without any quantitative metrics, baseline comparisons, ablation studies, or numerical results. This omission makes it impossible to determine whether the data support the central claim.
Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will add specific metrics (e.g., human alignment correlations and benchmark generalization scores) together with brief baseline comparisons drawn from the experiments section. This change will make the central claims directly verifiable from the abstract. revision: yes
-
Referee: [Abstract and implied methodology] Dataset construction and training loop (implied in Abstract): RE-Reward is fit directly to the 1.3M human judgments collected on ReasonEdit-22K CoT samples, after which the same ReasonEdit-22K data is used to generate RE-Reward scores for GRPO training of ReasonEdit. No train/test split for the reward model, inter-annotator agreement statistics, or independent validation set is mentioned, raising a load-bearing risk that reported gains reflect reward hacking or data leakage rather than genuine interpretability.
Authors: We acknowledge this methodological concern. We will revise the manuscript to explicitly document the data partitioning: RE-Reward was trained on a held-out subset of the 1.3M judgments with a separate validation split; inter-annotator agreement statistics will be reported; and the CoT samples used for GRPO were drawn from a disjoint portion of ReasonEdit-22K. These clarifications will demonstrate that the training procedure avoids leakage and that performance gains reflect genuine alignment. revision: yes
Circularity Check
No significant circularity; derivation relies on independent human annotations and external benchmarks
full rationale
The paper constructs ReasonEdit-22K with 1.3M external human judgments on logicality/accuracy/usefulness as the primary training signal. RE-Reward is fit to these judgments, after which ReasonEdit is optimized via GRPO using RE-Reward scores. The central claims (human alignment and generalization) are evaluated on public benchmarks that lie outside the training distribution. No equation or claim reduces a reported prediction to a quantity defined by the same fitted values; no self-citation supplies a uniqueness theorem or ansatz; the human signal is treated as an independent oracle rather than a self-generated loop. This is a standard reward-model-plus-RL pipeline with external validation, hence self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human judgments collected on logicality, accuracy, and usefulness of CoT samples provide a reliable training signal for reward models and downstream evaluators.
invented entities (2)
-
RE-Reward
no independent evidence
-
ReasonEdit
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RE-Reward... trained using reward signals derived from RE-Reward and the Group Relative Policy Optimization (GRPO) algorithm... 1.3M human judgments assessing these interpretations in terms of logicality, accuracy, and usefulness
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ReasonEdit... trained via supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai et al. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Meteor: An automatic metric for mt evaluation with improved correlation with human judgments
Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InACL Workshop, 2005
work page 2005
-
[3]
Instructpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InCVPR, 2023
work page 2023
-
[4]
B. Chen et al. Topiq: A transformed-order prioritized image quality assessment.arXiv preprint arXiv:2308.XXXXX, 2023
work page 2023
-
[5]
Internvl2: Better and faster vision-language understanding.arXiv preprint arXiv:2407.XXXXX, 2024
Zhe Chen et al. Internvl2: Better and faster vision-language understanding.arXiv preprint arXiv:2407.XXXXX, 2024
work page 2024
-
[6]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Gemma: Open Models Based on Gemini Research and Technology
Google DeepMind. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Qlora: Efficient finetuning of quantized llms
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. InNeurIPS, 2023
work page 2023
-
[9]
Finevq: Fine-grained user generated content video quality assessment
Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xiaoyun Zhang, and Guangtao Zhai. Finevq: Fine-grained user generated content video quality assessment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[10]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, et al. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the International Conference on Machine Learning (ICML), 2024
work page 2024
-
[11]
Simcse: Simple contrastive learning of sentence embeddings
Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. InEMNLP, 2021
work page 2021
-
[12]
Gemini 2.0: A next-generation multimodal model.Technical Report, 2024
Google. Gemini 2.0: A next-generation multimodal model.Technical Report, 2024
work page 2024
-
[13]
Gemini 3.1 pro: Best for complex tasks and bringing creative concepts to life
Google DeepMind. Gemini 3.1 pro: Best for complex tasks and bringing creative concepts to life. https://deepmind.google/models/gemini/pro/, 2025
work page 2025
-
[14]
Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, et al. Unireditbench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025
-
[15]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel et al. Clipscore: A reference-free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718, 2021
work page internal anchor Pith review arXiv 2021
-
[16]
Shengding Hu et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2404.02733, 2024
-
[17]
Y . Jiang et al. Genai-bench: A comprehensive benchmark for generative ai.arXiv preprint arXiv:2406.XXXXX, 2024
work page 2024
-
[18]
Y . Kirstain et al. Pick-a-pic: An open dataset of user preferences for text-to-image generation.NeurIPS, 2023
work page 2023
-
[19]
Pick-a-pic: An open dataset of user preferences for text-to-image analysis.NeurIPS, 2023
Yuval Kirstain et al. Pick-a-pic: An open dataset of user preferences for text-to-image analysis.NeurIPS, 2023
work page 2023
-
[20]
arXiv preprint arXiv:2509.26346 (2025)
Benno Krojer et al. Editreward: A human-aligned reward model for instruction-guided image editing. arXiv preprint arXiv:2509.26346, 2025
-
[21]
Learning action and reasoning-centric image editing from videos and simulation
Benno Krojer, Dheeraj Vattikonda, et al. Learning action and reasoning-centric image editing from videos and simulation. InNeurIPS, 2024. 10
work page 2024
-
[22]
Yebin Lee, Imseong Park, and Myungjoo Kang. Fleur: An explainable reference-free evaluation metric for image captioning using a large multimodal model. InProceedings of the Association for Computational Linguistics (ACL), pages 3732–3746, 2024
work page 2024
-
[23]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InACL Workshop, 2004
work page 2004
- [24]
-
[25]
Evaluating text-to-visual generation with image-to-text generation
Z. Lin et al. Vqascore: Evaluating text-to-image generation with visual question answering.arXiv preprint arXiv:2404.01291, 2024
-
[26]
Boyang Liu, Yifan Hu, Senjie Jin, Shihan Dou, Gonglei Shi, Jie Shao, Tao Gui, and Xuanjing Huang. Unlocking the essence of beauty: Advanced aesthetic reasoning with relative-absolute policy optimization. arXiv preprint arXiv:2509.21871, 2025
-
[27]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023
work page 2023
-
[28]
Deepseek-vl2: Mixture-of-experts vision-language models.arXiv preprint arXiv:2412.XXXXX, 2024
Haoyu Lu et al. Deepseek-vl2: Mixture-of-experts vision-language models.arXiv preprint arXiv:2412.XXXXX, 2024
work page 2024
-
[29]
Ovis2.5 technical report, 2025
Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, et al. Ovis2.5 technical report. arXiv:2508.11737, 2025
- [30]
-
[31]
Distributed representations of words and phrases and their compositionality
Tomas Mikolov et al. Distributed representations of words and phrases and their compositionality. In NeurIPS, 2013
work page 2013
-
[32]
No-reference image quality assessment in the spatial domain.IEEE TIP, 21(12):4695–4708, 2012
Anish Mittal, Anush Krishna Moorthy, and Alan C Bovik. No-reference image quality assessment in the spatial domain.IEEE TIP, 21(12):4695–4708, 2012
work page 2012
-
[33]
Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a "completely blind" image quality analyzer. IEEE Signal Processing Letters, 20(3):209–212, 2012
work page 2012
-
[34]
A survey on zero-knowledge machine learning,
S. Mo et al. Interpretable reward models via decomposable attribution.arXiv preprint arXiv:2501.01234, 2025
-
[35]
Anush Krishna Moorthy and Alan C Bovik. Blind image quality assessment: From natural scene statistics to perceptual quality.IEEE TIP, 20(12):3350–3364, 2011
work page 2011
-
[36]
Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024
OpenAI. Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024
work page 2024
-
[37]
Gpt-5 technical report.Technical Report, 2025
OpenAI. Gpt-5 technical report.Technical Report, 2025
work page 2025
-
[38]
Gpt image 1: State-of-the-art image generation model, 2025
OpenAI. Gpt image 1: State-of-the-art image generation model, 2025. https://platform.openai. com/docs/models/gpt-image-1
work page 2025
-
[39]
Internvl 3.5: Open-source vision-language model.arXiv preprint, 2025
OpenGVLab. Internvl 3.5: Open-source vision-language model.arXiv preprint, 2025
work page 2025
-
[40]
S. Peng et al. Imagenhub: Standardizing the evaluation of conditional image generation.arXiv preprint arXiv:2310.XXXXX, 2023
work page 2023
-
[41]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022
work page 2022
-
[42]
Positive-augmented contrastive learning for image and video captioning evaluation
Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Positive-augmented contrastive learning for image and video captioning evaluation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6914–6924, 2023
work page 2023
-
[43]
Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint, 2025
Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint, 2025
work page 2025
-
[44]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, April 2024. arXiv:2402.03300 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Qwen3 technical report.arXiv preprint, 2025
Alibaba Qwen Team. Qwen3 technical report.arXiv preprint, 2025. 11
work page 2025
- [46]
-
[47]
Editscore: Unlocking online rl for image editing via high-fidelity reward modeling
VectorSpaceLab. Editscore: Unlocking online rl for image editing via high-fidelity reward modeling. In ICLR, 2026
work page 2026
-
[48]
Chonghuinan Wang, Zihan Chen, Yuxiang Wei, Tianyi Jiang, Xiaohe Wu, Fan Li, Wangmeng Zuo, and Hongxun Yao. Creval: An automated interpretable evaluation for creative image manipulation under complex instructions.arXiv preprint arXiv:2603.26174, 2026
-
[49]
Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms
Jiarui Wang, Huiyu Duan, Yu Zhao, Juntong Wang, Guangtao Zhai, and Xiongkuo Min. Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17312–17323, 2025
work page 2025
-
[50]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE TIP, 13(4):600–612, 2004
work page 2004
-
[51]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. VisualQuality-R1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025
-
[54]
arXiv preprint arXiv:2510.06679 (2025)
Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, et al. Dreamomni2: Multimodal instruction-based editing and generation.arXiv preprint arXiv:2510.06679, 2025
-
[55]
Imagereward: Learning and evaluating human preferences for text-to-image generation
Jialite Xu et al. Imagereward: Learning and evaluating human preferences for text-to-image generation. NeurIPS, 2023
work page 2023
-
[56]
Imagereward: Learning and evaluating human preferences for text-to-image generation
Jiazheng Xu et al. Imagereward: Learning and evaluating human preferences for text-to-image generation. NeurIPS, 2023
work page 2023
-
[57]
Zitong Xu, Huiyu Duan, Zhongpeng Ji, Xinyun Zhang, Yutao Liu, Xiongkuo Min, et al. Edithf-1m: A million-scale rich human preference feedback for image editing.arXiv preprint arXiv:2603.14916, 2026
-
[58]
Harmonyiqa: Pioneering benchmark and model for image harmonization quality assessment
Zitong Xu, Huiyu Duan, Guangji Ma, Liu Yang, Jiarui Wang, Qingbo Wu, et al. Harmonyiqa: Pioneering benchmark and model for image harmonization quality assessment. InIEEE International Conference on Multimedia and Expo (ICME), pages 1–6, 2025
work page 2025
-
[59]
Zitong Xu et al. Lmm4edit: Benchmarking and evaluating multimodal image editing with lmms.arXiv preprint arXiv:2507.16193, 2025
-
[60]
Wufeng Xue, Lei Zhang, Xuanqin Mou, and Alan C Bovik. Gradient magnitude similarity deviation: A highly efficient perceptual image quality index.IEEE TIP, 23(2):684–695, 2013
work page 2013
-
[61]
Maniqa: Multi-dimension attention network for no-reference image quality assessment
Pengfei Yang et al. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InCVPR Workshops, 2022
work page 2022
-
[62]
Juncai Yao, Jing Shen, and Congying Yao. Image quality assessment based on the perceived structural similarity index of an image.Mathematical Biosciences and Engineering, 20(5):9385–9409, 2023
work page 2023
-
[63]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye et al. mplug-owl: Modularization empowers large hallucination-aware multimodal models. arXiv preprint arXiv:2304.14178, 2023
work page Pith review arXiv 2023
-
[64]
Content-variant reference image quality assessment via knowledge distillation
Guanghao Yin, Wei Wang, Zehuan Yuan, et al. Content-variant reference image quality assessment via knowledge distillation. InAAAI, volume 36, pages 3134–3142, 2022
work page 2022
-
[65]
Magicbrush: A large-scale dataset for instruction-guided real image editing.NeurIPS, 2024
Kai Zhang et al. Magicbrush: A large-scale dataset for instruction-guided real image editing.NeurIPS, 2024
work page 2024
-
[66]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Renrui Zhang et al. Llama-adapter: Efficient fine-tuning of language models with zero-init attention.arXiv preprint arXiv:2303.16199, 2023
work page Pith review arXiv 2023
-
[67]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, pages 586–595, 2018
work page 2018
-
[68]
arXiv preprint arXiv:2312.17090 (2023)
Wu Zhang et al. Q-align: Teaching lmms for visual scoring via language-to-score alignment.arXiv preprint arXiv:2312.17090, 2023. 12
-
[69]
Chujie Zheng et al. Critique-llm: Scaling feedback generation for large language models.arXiv preprint arXiv:2405.00123, 2024
-
[70]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, et al. Internvl3: Ex- ploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13 Supplementary Material A Overview This supplementary material provides additional details for the data construction, annotation pro- tocol...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[73]
Content Preservation (Content consistency) E.g., consistency of the main structure with the original, preservation of unedited areas, style consistency. [Final Assessment] After outputting [Final Assessment], immediately continue with exactly three scores for Vi- sual Quality, Editing Alignment, and Content Preservation in one line, separated by commas, w...
-
[74]
logicality: internal consistency, coherent reasoning, and absence of contradictions
-
[75]
accuracy: factual alignment with the source image, edited image, and editing instruction
-
[76]
usefulness: specificity, diagnostic value, and usefulness for reward modeling. Summarize the grounded evidence into the final anchor token sequence for regression. Reward Scores: D Details of ReasonEdit D.1 Dual-head model architecture ReasonEdit is a multimodal generator-regressor for interpretable TIE evaluation. It takes the source image, edited image,...
-
[77]
Visual Quality (Naturalness of the edit and image) E.g., lighting, clarity, color, details, realism, etc
-
[78]
Editing Alignment (Adherence to editing instructions) Whether the instruction is fully or partially implemented, and the effectiveness of the imple- mentation
-
[79]
Content Preservation (Content consistency) E.g., consistency of the main structure with the original, preservation of unedited areas, style consistency. [Final Assessment] After outputting [Final Assessment], immediately continue with exactly three scores for Vi- sual Quality, Editing Alignment, and Content Preservation in one line, separated by commas, w...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.