pith. machine review for the scientific record. sign in

arxiv: 2604.19406 · v1 · submitted 2026-04-21 · 💻 cs.CV · cs.AI

Recognition: unknown

HP-Edit: A Human-Preference Post-Training Framework for Image Editing

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords image editinghuman preference alignmentpost-trainingdiffusion modelsreinforcement learning from human feedbackautomatic evaluatorpreference dataset
0
0 comments X

The pith

A scorer trained on small human preference data enables scalable post-training of image editing models to better match human tastes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HP-Edit to apply human feedback to image editing models more efficiently. It creates an automatic scorer called HP-Scorer using a small set of human ratings and a pretrained visual language model. This scorer then helps build a large preference dataset and acts as the reward signal for training the editing model. The result is that models produce edits that people prefer more in everyday tasks. Readers would care because it reduces the need for huge amounts of human labeling while improving output quality.

Core claim

By training HP-Scorer on limited human-preference scoring data together with a pretrained VLM, the framework can automatically evaluate and score edited images according to human preferences. This allows efficient construction of the RealPref-50K dataset across eight editing tasks and serves as the reward function for post-training diffusion-based editing models such as Qwen-Image-Edit-2509, resulting in outputs that align more closely with human preference as shown in experiments on RealPref-Bench.

What carries the argument

The HP-Scorer, an automatic human preference-aligned evaluator developed from small human data and a VLM to score editing results.

If this is right

  • Editing models post-trained this way will generate results preferred by humans on common tasks like object editing.
  • Large-scale preference datasets can be created without proportional increases in human effort.
  • A dedicated benchmark RealPref-Bench allows standardized evaluation of real-world editing performance.
  • The gap in applying RLHF techniques to diffusion image editing is addressed through this scalable approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar scorers could be developed for other creative AI tasks like video editing or text-to-image generation to automate preference alignment.
  • Over time, the scorer could be updated with new human data to adapt to changing preferences or new editing styles.
  • This method might lower barriers for smaller teams to fine-tune advanced editing models without access to massive annotation resources.

Load-bearing premise

The HP-Scorer accurately captures unbiased human preferences for a wide range of editing tasks using only a small amount of initial data and a pretrained VLM.

What would settle it

A direct comparison where humans rate the edited outputs from the post-trained model lower than the base model on a held-out set of diverse editing prompts would falsify the effectiveness claim.

Figures

Figures reproduced from arXiv: 2604.19406 by Chonghuinan Wang, Fan Li, Fenglong Song, Jiaqi Xu, Jiaxiu Jiang, Lina Lei, Renjing Pei, Wangmeng Zuo, Xinran Qin, Yuping Qiu, Zhikai Chen, Zhixin Wang.

Figure 1
Figure 1. Figure 1: Visual comparison before and after applying HP-Edit based on the pretrained Qwen-Image-Edit-2509, across eight common [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overview of the proposed framework, HP-Edit which consists of three stages: the task-aware HP-Scorer for human preference [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The details of task and object distribution in RealPref [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of on the RealPref-Bench across eight common editing tasks. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Reward curves of HP-Edit with different settings. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: HP-score and user score [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Common image editing tasks typically adopt powerful generative diffusion models as the leading paradigm for real-world content editing. Meanwhile, although reinforcement learning (RL) methods such as Diffusion-DPO and Flow-GRPO have further improved generation quality, efficiently applying Reinforcement Learning from Human Feedback (RLHF) to diffusion-based editing remains largely unexplored, due to a lack of scalable human-preference datasets and frameworks tailored to diverse editing needs. To fill this gap, we propose HP-Edit, a post-training framework for Human Preference-aligned Editing, and introduce RealPref-50K, a real-world dataset across eight common tasks and balancing common object editing. Specifically, HP-Edit leverages a small amount of human-preference scoring data and a pretrained visual large language model (VLM) to develop HP-Scorer--an automatic, human preference-aligned evaluator. We then use HP-Scorer both to efficiently build a scalable preference dataset and to serve as the reward function for post-training the editing model. We also introduce RealPref-Bench, a benchmark for evaluating real-world editing performance. Extensive experiments demonstrate that our approach significantly enhances models such as Qwen-Image-Edit-2509, aligning their outputs more closely with human preference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes HP-Edit, a post-training framework for human-preference-aligned image editing. It develops HP-Scorer from a small human-preference seed set plus a pretrained VLM, uses this scorer to automatically label the RealPref-50K dataset across eight editing tasks, and employs the same scorer as the reward signal for RL post-training (e.g., on Qwen-Image-Edit-2509). A new RealPref-Bench is introduced for evaluation, with the central claim that the approach yields outputs significantly better aligned with human preferences than the base model.

Significance. If the HP-Scorer is shown to be accurate and unbiased, the framework would provide a practical route to scalable RLHF for diffusion-based editing models, addressing the noted scarcity of preference datasets and tailored training methods in this domain.

major comments (3)
  1. [Abstract and Experiments section] The abstract states that 'extensive experiments demonstrate significant enhancement' but supplies no quantitative metrics, baselines, ablation results, or scorer validation statistics (e.g., Pearson/Spearman correlation with held-out human ratings, inter-task consistency, or bias analysis). Because HP-Scorer labels the entire 50K dataset and serves as the RL reward, this omission leaves the central empirical claim without visible supporting evidence.
  2. [HP-Scorer development and RealPref-50K construction] HP-Scorer is trained on limited human seed data and then used both to construct RealPref-50K labels and as the reward function for post-training. No independent validation (cross-validation on held-out human judgments, error analysis per editing task, or comparison against direct human scoring) is described; any systematic bias in the scorer would be amplified in the preference dataset and directly shape the policy gradient, undermining the claim of genuine human-preference alignment.
  3. [Experiments and RealPref-Bench evaluation] The evaluation on RealPref-Bench reports improvements for Qwen-Image-Edit-2509 but does not include standard RLHF baselines (e.g., Diffusion-DPO or Flow-GRPO applied without the HP-Edit pipeline) or ablations that isolate the contribution of the scorer-derived reward versus the dataset alone. This makes it impossible to attribute gains specifically to the proposed framework.
minor comments (2)
  1. [Method] Clarify the exact size and composition of the initial human seed set used to train HP-Scorer, including how many ratings per editing task.
  2. [Post-training details] Provide the precise RL objective and hyper-parameters used when the HP-Scorer serves as reward (e.g., PPO or GRPO variant, clipping values).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, outlining the revisions that will be incorporated into the next version of the manuscript to strengthen the empirical support and clarity of our claims.

read point-by-point responses
  1. Referee: [Abstract and Experiments section] The abstract states that 'extensive experiments demonstrate significant enhancement' but supplies no quantitative metrics, baselines, ablation results, or scorer validation statistics (e.g., Pearson/Spearman correlation with held-out human ratings, inter-task consistency, or bias analysis). Because HP-Scorer labels the entire 50K dataset and serves as the RL reward, this omission leaves the central empirical claim without visible supporting evidence.

    Authors: We agree that the abstract would benefit from greater specificity to make the central claims immediately evident. The Experiments section already contains quantitative results on RealPref-Bench (including preference alignment metrics and model comparisons), but these are not summarized in the abstract. We will revise the abstract to include key quantitative findings such as win rates on human preference judgments and overall improvement scores. We will also add a dedicated subsection on HP-Scorer validation that reports Pearson and Spearman correlations with held-out human ratings, inter-task consistency, and bias analysis. revision: yes

  2. Referee: [HP-Scorer development and RealPref-50K construction] HP-Scorer is trained on limited human seed data and then used both to construct RealPref-50K labels and as the reward function for post-training. No independent validation (cross-validation on held-out human judgments, error analysis per editing task, or comparison against direct human scoring) is described; any systematic bias in the scorer would be amplified in the preference dataset and directly shape the policy gradient, undermining the claim of genuine human-preference alignment.

    Authors: We acknowledge the importance of rigorous independent validation for the HP-Scorer given its central role. The current manuscript describes the training procedure but does not include the requested validation details. In the revision we will add a new validation subsection that reports cross-validation results on held-out human judgments, per-task error analysis across the eight editing tasks, and direct comparisons of HP-Scorer outputs against additional human scoring. Any detected biases will be quantified and discussed. revision: yes

  3. Referee: [Experiments and RealPref-Bench evaluation] The evaluation on RealPref-Bench reports improvements for Qwen-Image-Edit-2509 but does not include standard RLHF baselines (e.g., Diffusion-DPO or Flow-GRPO applied without the HP-Edit pipeline) or ablations that isolate the contribution of the scorer-derived reward versus the dataset alone. This makes it impossible to attribute gains specifically to the proposed framework.

    Authors: We agree that additional baselines and ablations are necessary to isolate the contribution of the HP-Edit framework. The current evaluation focuses on the end-to-end improvement but does not include the suggested comparisons. We will expand the Experiments section to include results from Diffusion-DPO and Flow-GRPO applied directly to the base model (without the HP-Edit pipeline) as well as ablations that separately evaluate the scorer-derived reward signal versus training on RealPref-50K alone. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses external human seed data plus pretrained VLM to scale labels and reward, with claims resting on independent benchmark experiments

full rationale

The paper constructs HP-Scorer from a small external human-preference dataset plus a pretrained VLM, then applies the scorer to label RealPref-50K and to supply the RL reward signal. This is a standard semi-supervised scaling step rather than a self-definitional loop or fitted-input prediction. The central claim of improved human alignment is supported by experiments on the separately introduced RealPref-Bench, which is not shown to be constructed from the same scorer outputs in a way that forces the result. No equations, self-citations, or uniqueness theorems are invoked that reduce the final performance gain to the input data by construction. Potential bias propagation is a correctness risk, not a circularity violation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unstated assumption that the VLM can be fine-tuned into a reliable human-preference proxy using only a small seed set, plus standard RL assumptions about reward modeling in diffusion editing; no explicit free parameters or invented entities are named in the abstract.

axioms (2)
  • domain assumption A pretrained VLM can be adapted with limited human ratings to serve as an accurate proxy for human editing preferences across eight tasks.
    Invoked in the description of HP-Scorer development.
  • domain assumption Using the scorer to label a larger dataset and as RL reward will produce models that generalize to real-world human preferences.
    Central to the post-training pipeline.

pith-pipeline@v0.9.0 · 5552 in / 1534 out tokens · 36860 ms · 2026-05-10T03:21:17.252972+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

    cs.CV 2026-05 unverdicted novelty 6.0

    Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...

Reference graph

Works this paper leans on

102 extracted references · 29 canonical work pages · cited by 1 Pith paper · 19 internal anchors

  1. [1]

    FLUX.https://github.com/black- forest- labs/flux. 2

  2. [2]

    com / Stability-AI/StableDiffusion

    Stable Diffusion.https : / / github . com / Stability-AI/StableDiffusion. 2

  3. [3]

    Pixabay.https://pixabay.com. 5

  4. [4]

    Ntire 2017 challenge on single image super-resolution: Dataset and study

    Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition workshops, pages 126–135, 2017. 5

  5. [5]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2

  6. [6]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning.arXiv preprint arXiv:2305.13301, 2023. 3

  7. [7]

    In- structpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 2

  8. [8]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2, 6, 7, 8

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2

  10. [10]

    et al.\ (2025)

    Zhen Han, Zeyinzi Jiang, Yulin Pan, Jingfeng Zhang, Chao- jie Mao, Chenwei Xie, Yu Liu, and Jingren Zhou. Ace: All- round creator and editor following instructions via diffusion transformer.arXiv preprint arXiv:2410.00086, 2024. 2

  11. [11]

    Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324,

    Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, and Bo Zhang. Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324, 2025. 3

  12. [12]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2

  13. [13]

    D-fusion: Direct preference optimization for aligning diffusion mod- els with visually consistent samples.arXiv preprint arXiv:2505.22002, 2025

    Zijing Hu, Fengda Zhang, and Kun Kuang. D-fusion: Direct preference optimization for aligning diffusion mod- els with visually consistent samples.arXiv preprint arXiv:2505.22002, 2025. 3

  14. [14]

    Smartedit: Exploring complex instruction-based image editing with multimodal large lan- guage models

    Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. Smartedit: Exploring complex instruction-based image editing with multimodal large lan- guage models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8362– 8371, 2024. 2

  15. [15]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2

  16. [16]

    Dual prompting image restoration with diffusion transformers

    Dehong Kong, Fan Li, Zhixin Wang, Jiaqi Xu, Renjing Pei, Wenbo Li, and WenQi Ren. Dual prompting image restoration with diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12809–12819, 2025. 2

  17. [17]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

  18. [18]

    Magiceraser: Erasing any objects via semantics-aware control

    Fan Li, Zixiao Zhang, Yi Huang, Jianzhuang Liu, Renjing Pei, Bin Shao, and Songcen Xu. Magiceraser: Erasing any objects via semantics-aware control. InEuropean Confer- ence on Computer Vision, pages 215–231. Springer, 2024. 2

  19. [19]

    Lsdir: A large scale dataset for image restoration

    Yawei Li, Kai Zhang, Jingyun Liang, Jiezhang Cao, Ce Liu, Rui Gong, Yulun Zhang, Hao Tang, Yun Liu, Denis Deman- dolx, et al. Lsdir: A large scale dataset for image restoration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1775–1787, 2023. 5

  20. [20]

    Brushedit: All-in-one image inpainting and editing

    Yaowei Li, Yuxuan Bian, Xuan Ju, Zhaoyang Zhang, Junhao Zhuang, Ying Shan, Yuexian Zou, and Qiang Xu. Brushedit: All-in-one image inpainting and editing.arXiv preprint arXiv:2412.10316, 2024. 2

  21. [21]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic en- coders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025. 2, 6, 7, 8

  22. [22]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 2, 5

  23. [23]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2, 3

  24. [24]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv preprint arXiv:2505.05470, 2025. 2, 3

  25. [25]

    Videodpo: Omni- preference alignment for video diffusion generation

    Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni- preference alignment for video diffusion generation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 8009–8019, 2025. 3

  26. [26]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chun- rui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025. 2, 6, 7, 8, 1

  27. [27]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 2, 3

  28. [28]

    Mia-dpo: Multi-image augmented di- rect preference optimization for large vision-language mod- els

    Ziyu Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Conghui He, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Mia-dpo: Multi-image augmented di- rect preference optimization for large vision-language mod- els.arXiv preprint arXiv:2410.17637, 2024. 3

  29. [29]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 5

  30. [30]

    Sample by step, optimize by chunk: Chunk-level grpo for text-to-image generation.arXiv preprint arXiv:2510.21583, 2025

    Yifu Luo, Penghui Du, Bo Li, Sinan Du, Tiantian Zhang, Yongzhe Chang, Kai Wu, Kun Gai, and Xueqian Wang. Sample by step, optimize by chunk: Chunk-level grpo for text-to-image generation.arXiv preprint arXiv:2510.21583,

  31. [31]

    X2edit: Revisiting arbitrary- instruction image editing through self-constructed data and task-aware representation learning.ICCV, 2025

    Jian Ma, Xujie Zhu, Zihao Pan, Qirong Peng, Xu Guo, Chen Chen, and Haonan Lu. X2edit: Revisiting arbitrary- instruction image editing through self-constructed data and task-aware representation learning.ICCV, 2025. 2, 6, 7, 8, 1

  32. [32]

    Ace++: Instruction- based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025

    Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction- based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025. 2

  33. [33]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171. PMLR,

  34. [34]

    Camedit: Continuous camera parameter control for photorealistic image editing

    Xinran Qin, Zhixin Wang, Fan Li, Haoyu Chen, Renjing Pei, Wenbo Li, and Xiaochun Cao. Camedit: Continuous camera parameter control for photorealistic image editing. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems, 2025. 2

  35. [35]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023. 3

  36. [36]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10674– 10685, 2022. 2

  37. [37]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 3, 4

  38. [38]

    Seededit: Align image re-generation to image editing

    Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024. 2

  39. [39]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational confer- ence on machine learning, pages 2256–2265. pmlr, 2015. 2

  40. [40]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 2

  41. [41]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 2

  42. [42]

    Pocketsr: The super-resolution expert in your pocket mobiles.NIPS, 2025

    Haoze Sun, Linfeng Jiang, Fan Li, Renjing Pei, Zhixin Wang, Yong Guo, Jiaqi Xu, Haoyu Chen, Jin Han, Fenglong Song, et al. Pocketsr: The super-resolution expert in your pocket mobiles.NIPS, 2025. 2

  43. [43]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press Cambridge, 1998. 2

  44. [44]

    BalancedDPO: Adaptive Multi-Metric Alignment

    Dipesh Tamboli, Souradip Chakraborty, Aditya Malusare, Biplab Banerjee, Amrit Singh Bedi, and Vaneet Aggar- wal. Balanceddpo: Adaptive multi-metric alignment.arXiv preprint arXiv:2503.12575, 2025. 3

  45. [45]

    Diffusion model align- ment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024. 2, 3

  46. [46]

    Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

    Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for sta- ble text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025. 3

  47. [47]

    Ace: Anti-editing concept erasure in text-to-image models

    Zihao Wang, Yuxiang Wei, Fan Li, Renjing Pei, Hang Xu, and Wangmeng Zuo. Ace: Anti-editing concept erasure in text-to-image models. 2025. 2

  48. [48]

    Vtinker: Guided flow upsampling and texture mapping for high-resolution video frame interpolation.arXiv preprint arXiv:2511.16124, 2025

    Chenyang Wu, Jiayi Fu, Chun-Le Guo, Shuhao Han, and Chongyi Li. Vtinker: Guided flow upsampling and texture mapping for high-resolution video frame interpolation.arXiv preprint arXiv:2511.16124, 2025. 2

  49. [49]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 2, 5, 7, 8, 1

  50. [50]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 2, 6, 7, 8

  51. [51]

    DenseDPO: Fine-grained temporal preference optimization for video diffusion models, 2025

    Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, and Aliaksandr Siarohin. Densedpo: Fine-grained temporal preference optimization for video diffusion models.arXiv preprint arXiv:2506.03517, 2025. 3

  52. [52]

    Omnigen: Unified image genera- tion

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025. 2

  53. [53]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025. 2, 3

  54. [54]

    Anyedit: Mastering unified high-quality image editing for any idea

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26125–26135, 2025. 2

  55. [55]

    Magicbrush: A manually annotated dataset for instruction- guided image editing.Advances in Neural Information Pro- cessing Systems, 36:31428–31449, 2023

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction- guided image editing.Advances in Neural Information Pro- cessing Systems, 36:31428–31449, 2023. 2

  56. [56]

    Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Pro- cessing Systems, 37:3058–3093, 2024

    Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Pro- cessing Systems, 37:3058–3093, 2024. 2

  57. [57]

    Add a person standing on the green turf next to the paragliding harness, wearing a white helmet, holding the paraglider’s control lines

    Huaisheng Zhu, Teng Xiao, and Vasant G Honavar. Dspo: Direct score preference optimization for diffusion model alignment. InThe Thirteenth International Conference on Learning Representations, 2025. 3 HP-Edit: A Human-Preference Post-Training Framework for Image Editing Supplementary Material Section S1 provides more details of experiments of the main pap...

  58. [58]

    Does Image A contain a clearly identifiable subject or main object?

  59. [59]

    Does the object mentioned in the instruction appear in Image A?

  60. [60]

    Has the object been successfully removed in Image B?

  61. [62]

    Scoring Guidelines: • 0: The edited result is completely incorrect, does not follow the Editing Instruction at all, or fails to meet any of the requirements

    Does Image B look visually natural and realistic, without artifacts or corrupted regions? In particular, does the region where the object was removed avoid unnatural blur or unnatural shadows? You need to rate the editing result from 0 to 5 based on the accuracy and quality of the edit. Scoring Guidelines: • 0: The edited result is completely incorrect, d...

  62. [63]

    Is Image A of high quality (clear, undistorted, and visually usable)?

  63. [64]

    Has the target object been successfully added in Image B?

  64. [65]

    Are Image A and Image B meaningfully different (not nearly identical)?

  65. [66]

    Does Image B look visually natural and realistic, without obvious artifacts, corrupted regions, unnatural blur, or unnatural shadows in the region where the object was added?

  66. [67]

    Scoring Guidelines: • 0: The edited result is completely incorrect, does not follow the Editing Instruction at all, or fails to meet any of the requirements

    Do the objects added in Image B follow the given editing instruction accurately (in terms of category, attributes, position, and other specified details)? You need to rate the editing result from 0 to 5 based on the accuracy and quality of the edit. Scoring Guidelines: • 0: The edited result is completely incorrect, does not follow the Editing Instruction...

  67. [68]

    Does the original Image A contain a clearly identifiable person or object that is required to be replaced according to the editing instruction?

  68. [69]

    Does the object replacement (swapping) operation described in the instruction satisfy both logical feasibility and a clear, unambiguous description?

  69. [70]

    Comparing Image B with Image A, has the original object that needs to be replaced in A completely disappeared in B?

  70. [71]

    Is the replacement object in Image B clear and complete, without missing parts or distorted local shapes?

  71. [72]

    Does the replacement object in Image B meet the description requirements specified in the instruction (category, attributes, pose, position, etc.)?

  72. [73]

    Are there no extra objects in Image B that are not required by the editing instruction?

  73. [74]

    Does Image B completely retain the background information of Image A, without background loss, distortion, or damage?

  74. [75]

    Does Image B completely retain the parts of the original image that were not mentioned in the editing instruction?

  75. [76]

    Scoring guidelines: • 0: The edited result is completely incorrect, does not follow the editing instruction at all, or fails to meet any of the requirements

    Does Image B look realistic and consistent with physical and real-world logic (no unsupported floating objects, no object penetration, no obvious compositing artifacts)? You need to rate the editing result from 0 to 5 based on the accuracy and quality of the edit. Scoring guidelines: • 0: The edited result is completely incorrect, does not follow the edit...

  76. [77]

    Does Image A contain a clearly identifiable foreground subject (such as a person or an object)?

  77. [78]

    Does the editing instruction describe a valid background replacement operation?

  78. [79]

    Has the background in Image B changed compared to Image A, in accordance with the instruction?

  79. [80]

    Is the foreground subject preserved correctly in Image B (not missing, distorted, or corrupted)?

  80. [81]

    Scoring guidelines: • 0: The edited result is completely incorrect, does not follow the editing instruction at all, or fails to meet any of the requirements

    Does Image B look visually natural and realistic, without visible artifacts or unnatural blending? You need to rate the editing result from 0 to 5 based on the accuracy and quality of the edit. Scoring guidelines: • 0: The edited result is completely incorrect, does not follow the editing instruction at all, or fails to meet any of the requirements. • 1: ...

Showing first 80 references.