pith. sign in

arxiv: 2606.03216 · v1 · pith:RFLLXZI3new · submitted 2026-06-02 · 💻 cs.CV

Follow-Your-Preference++: Rethinking Preference Alignment for Image Inpainting

Pith reviewed 2026-06-28 10:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords image inpaintingpreference alignmentdirect preference optimizationreward modelsensemble methodsgenerative modelscomputer visionobject removal
0
0 comments X

The pith

An ensemble of nine public reward models lets direct preference optimization improve inpainting models without architecture changes or new data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines preference alignment for image inpainting by applying the direct preference optimization framework to pairs built from existing reward models. It tests nine different reward models across two inpainting baselines and two benchmarks, revealing that most supply useful training signals yet introduce consistent biases in brightness, composition, and color. A simple ensemble of these models reduces the biases and produces aligned models that exceed prior state-of-the-art results on automatic metrics, vision-language model judgments, and human ratings. The same aligned models also transfer to the related task of object removal, where the goal is coherent background completion rather than open creative filling.

Core claim

Preference data constructed from an ensemble of reward models, when used inside direct preference optimization, yields inpainting models that outperform prior state-of-the-art approaches on standard metrics, large vision-language model evaluations, and human assessments, while the same models transfer successfully to object removal; the gains arise without any architectural modification or additional training datasets.

What carries the argument

An ensemble of reward models that constructs preference pairs for direct preference optimization training.

If this is right

  • Preference alignment improves inpainting quality across different model architectures and generative mechanisms.
  • An ensemble of reward models produces more robust and generalizable performance than any single reward model.
  • The aligned models transfer to object removal, shifting the task from open-ended generation to coherent background completion.
  • Consistent scaling trends appear when the number of candidate pairs or samples is increased.
  • A calibrated ensemble variant further reduces reward hacking and increases robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Reward-model biases observed here may appear in other generative tasks that rely on the same public reward models.
  • The transfer result suggests the alignment procedure could be tested on additional image-editing operations such as outpainting or style transfer.
  • Future work could replace the public reward models with ones trained specifically on inpainting human preferences to test whether further gains are possible.
  • The finding that most reward models remain useful as data sources even when unreliable as evaluators may apply to preference tuning in other domains.

Load-bearing premise

The preference pairs built from the nine reward models reflect genuine human preferences for inpainting quality rather than the reward models' own biases toward brightness, composition, and color.

What would settle it

A controlled human study in which raters consistently prefer the outputs of the original unaligned inpainting models over the preference-aligned versions in side-by-side blind comparisons would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2606.03216 by Hideki Nakayama, Junkun Yuan, Toru Aonishi, Yue Ma, Yutao Shen.

Figure 1
Figure 1. Figure 1: Results of our model. The inpainting outputs generated by our model are visually coherent, semantically aligned with text prompts, and consistent with human aesthetic preferences. hibits the opposite tendency. Consistent with this, PickScore pairs well with BrushNet, which generates vibrant outputs, while HPSv2 pairs well with FLUX.1 Fill, which produces plainer ones. We further trace these biases to syste… view at source ↗
Figure 2
Figure 2. Figure 2: Candidate scaling (a-c) and sample scaling (d-f) using HPSv2, PickScore, and Ensemble. We employ Aesthetic, ImageR, HPSv2, PickScore, and GPT-4 for evaluation. The first and second row of each sub-figure is based on BrushNet and FLUX.1 Fill, respectively. We use training steps to indicate the consumed samples to align the scaling across models (their batch-sizes are different). 6 [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 3
Figure 3. Figure 3: Bias studies. In each sub-figure, the four images (from left to right) display: the masked image, followed by inpainting results from models trained using HPSv2, PickScore, and Ensemble. We omit text prompts for brevity. Zoom in to see details. Find more examples in the Appendix. (a) Examples from models trained using BrushNet. (b) Examples from models trained using FLUX.1 Fill [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of ablations. In each sub-figure, the three images (from left to right) display: the masked image, followed by inpainting results from baseline models and baseline models + preference alignment using Ensemble. We omit text prompts for brevity. Zoom in to see details. sessed by GPT-4. We hypothesize that this degradation stems from some shared common biases among these reward mod￾els. En… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons with state-of-the-art image inpainting models. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: User studies. We compare each pair of models by randomly sampling 100 pairs from their inpainting results. We invite 30 volunteers to participate in a blind assessment to determine which one is better (“A win”, “B win”, or “tie”) based on their preferences. We report the winning rates [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Searches of β, i.e., sub-figure (a), and learning rate (lr), i.e., sub-figure (b). The first and second row of each sub-figure is based on BrushNet and FLUX.1 Fill, respectively. markable performance gains. For completeness, we also re￾port the Qwen3-VL counterparts of [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Different Ensemble variants. In each sub-figure, the four images (from left to right) display: the masked image, followed by inpainting results from models trained using Ensemble (vanilla), Ensemble (w/o VQAScore), and Ensemble (Linear). We omit text prompts for brevity. Zoom in to see details. compute cost of each reward model for giving a score to a single sample in [PITH_FULL_IMAGE:figures/full_fig_p01… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results. In each sub-figure, the four images (from left to right) display: the masked image, followed by inpainting results from models trained using Ensemble (vanilla), CaPO, and CaEN. We omit text prompts for brevity. Zoom in to see details. Find more examples in the Appendix. Following CaPO, we obtain the CaEN objective: LCaEN(θ) = Et,ϵw,ϵ l  (REn(x w, c) − REn(x l , c)) − (−β((L w θ − Lw r… view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative results of ablations in object removal. In each sub-figure, the three images (from left to right) display: the masked image, followed by inpainting results from baseline models and baseline models + preference alignment [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: More results on reward model bias studies using BrushNet. In each sub-figure, the four images (from left to right) display: the masked image, followed by inpainting results from models trained using HPSv2, PickScore, and Ensemble. For optimal detail, view figures zoomed in. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: More results on reward model bias studies using FLUX.1 Fill. In each sub-figure, the four images (from left to right) display: the masked image, followed by inpainting results from models trained using HPSv2, PickScore, and Ensemble. For optimal detail, view figures zoomed in. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: More results on Ensemble bias studies using BrushNet. In each sub-figure, the four images (from left to right) display: the masked image, followed by inpainting results from models trained using Ensemble, CaPO, and CaEN. For optimal detail, view figures zoomed in. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: More results on Ensemble bias studies using FLUX.1 Fill. In each sub-figure, the four images (from left to right) display: the masked image, followed by inpainting results from models trained using Ensemble, CaPO, and CaEN. For optimal detail, view figures zoomed in. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
read the original abstract

We study preference alignment for image inpainting. Rather than proposing yet another method, we revisit the problem from first principles and reassess its core challenges. We adopt the widely used direct preference optimization framework and construct preference training data with publicly available reward models. Our empirical study spans nine reward models, two benchmarks, and two baseline inpainting models that differ in architecture and generative mechanism. Our main findings are: (1) Most reward models provide valid signals for preference data construction, although some are unreliable as evaluators. (2) Across models and benchmarks, preference data exhibits consistent trends under both candidate and sample scaling. (3) Reward models display pronounced biases--particularly in brightness, composition, and color scheme--that make them prone to inducing reward hacking. (4) A simple ensemble of reward models mitigates such biases and yields robust, generalizable performance. {\color{rebuttal_blue}(5) Preference alignment is transferable to the object removal task, where the goal shifts from open-ended creative generation to coherent background completion. (6) Further analysis reveals that a calibrated ensemble method further mitigates hacking and improves robustness.} Without modifying model architectures or introducing additional datasets, our models substantially outperform prior state-of-the-art models on standard metrics, large vision-language model evaluations, and human assessments. Our code is available at: https://github.com/shenytzzz/Follow-Your-Preference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper conducts an empirical study on preference alignment for image inpainting via Direct Preference Optimization (DPO). It constructs preference pairs using nine public reward models (plus an ensemble and calibrated variant) applied to outputs from two inpainting architectures on two benchmarks. The central claims are that most reward models provide usable signals, scaling trends are consistent, individual models exhibit biases in brightness/composition/color that risk reward hacking, the ensemble mitigates these issues, alignment transfers to object removal, and the resulting models substantially outperform prior SOTA on standard metrics, VLM evaluations, and human assessments without architectural modifications or new datasets. Code is released.

Significance. If the central claims hold, the work offers a practical, architecture-agnostic route to preference alignment for inpainting that leverages only existing reward models and benchmarks. The broad sweep across nine reward models, documentation of their biases, and demonstration of ensemble mitigation provide reusable insights for the community. Reproducibility is strengthened by the public code release. The transfer result to object removal further broadens applicability.

major comments (1)
  1. [Findings (3)–(4) and human evaluation results] The central claim of outperformance on human assessments (abstract and finding (4)) rests on the ensemble serving as a valid proxy for desirable inpainting quality. The manuscript documents pronounced biases in individual reward models and presents the ensemble as mitigation, yet provides no quantitative verification—such as rank correlation (Kendall’s τ or Spearman’s ρ), agreement rates, or pairwise comparison on a held-out set of inpainted images—between the ensemble’s preference rankings and independent human preference rankings. This verification is load-bearing for interpreting the human-assessment gains as generalizable improvements rather than optimization toward shared visual biases.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The point raised about validating the ensemble against human preferences is well-taken, and we address it directly below with a commitment to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Findings (3)–(4) and human evaluation results] The central claim of outperformance on human assessments (abstract and finding (4)) rests on the ensemble serving as a valid proxy for desirable inpainting quality. The manuscript documents pronounced biases in individual reward models and presents the ensemble as mitigation, yet provides no quantitative verification—such as rank correlation (Kendall’s τ or Spearman’s ρ), agreement rates, or pairwise comparison on a held-out set of inpainted images—between the ensemble’s preference rankings and independent human preference rankings. This verification is load-bearing for interpreting the human-assessment gains as generalizable improvements rather than optimization toward shared visual biases.

    Authors: We agree that explicit quantitative validation of the ensemble's alignment with human judgments would strengthen the claims. The human assessments reported in the paper are direct evaluations of the final model outputs (not mediated by the ensemble), providing independent evidence of improvement on standard metrics, VLM judgments, and human ratings. However, to address the concern about potential shared biases and to confirm the ensemble as a reliable proxy, we will add a new subsection in the revised manuscript. This will report Kendall’s τ and Spearman’s ρ, along with agreement rates, computed between the ensemble preference rankings and independent human preference rankings on a held-out set of inpainted images. We will also include pairwise comparison results. This analysis will be performed on data from the existing benchmarks and human study protocol. We believe this addition directly mitigates the load-bearing concern while preserving the architecture-agnostic nature of the study. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical study with external reward models and benchmarks

full rationale

The paper is an empirical investigation that constructs preference pairs from nine publicly available reward models, applies DPO to two baseline inpainting models, and reports gains on standard metrics, VLM evaluations, and human assessments. No mathematical derivations, equations, or first-principles results are presented that reduce by construction to fitted parameters or self-defined quantities within the paper. Claims rest on experimental outcomes across models and benchmarks rather than any self-referential prediction or ansatz. Self-citations, if present, are not load-bearing for the central empirical findings, which remain independently falsifiable against external human judgments and public benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that publicly available reward models supply preference signals that generalize beyond the models' own biases; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Reward models provide valid signals for preference data construction in image inpainting
    The paper constructs training data with nine public reward models and treats their outputs as usable preference labels.

pith-pipeline@v0.9.1-grok · 5794 in / 1202 out tokens · 28832 ms · 2026-06-28T10:49:13.281039+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 31 canonical work pages · 20 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 5

  2. [2]

    Blended latent diffusion.ACM transactions on graphics, 2023

    Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion.ACM transactions on graphics, 2023. 2, 10

  3. [3]

    A general theoretical paradigm to understand learning from human preferences

    Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Confer- ence on Artificial Intelligence and Statistics, 2024. 2

  4. [4]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022. 3

  5. [5]

    Image inpainting

    Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. Image inpainting. InProceedings of the 27th annual conference on Computer graphics and interactive techniques, 2000. 1, 2

  6. [6]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023. 1, 3

  7. [7]

    Flux.1-dev.https://huggingface

    BlackForestLabs. Flux.1-dev.https://huggingface. co/black- forest- labs/FLUX.1- dev, 2024. Ac- cessed: 2025-07-12. 2

  8. [8]

    Flux.1-fill-dev.https : / / huggingface.co/black- forest- labs/FLUX.1- Fill-dev, 2024

    BlackForestLabs. Flux.1-fill-dev.https : / / huggingface.co/black- forest- labs/FLUX.1- Fill-dev, 2024. Accessed: 2025-07-12. 1, 3, 10 16

  9. [9]

    Perception Encoder: The best visual embeddings are not at the output of the network

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, An- drea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Ra- jasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. arXiv preprint arXiv:2504.13181, 2025. 5

  10. [10]

    In- structpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 3

  11. [11]

    Follow-your-canvas: Higher-resolution video outpainting with extensive content generation.arXiv preprint arXiv:2409.01055, 2024

    Qihua Chen, Yue Ma, Hongfa Wang, Junkun Yuan, Wenzhe Zhao, Qi Tian, Hongmei Wang, Shaobo Min, Qifeng Chen, and Wei Liu. Follow-your-canvas: Higher-resolution video outpainting with extensive content generation.arXiv preprint arXiv:2409.01055, 2024. 2

  12. [12]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational conference on machine learning, 2024. 2, 3

  13. [13]

    Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models.Advances in Neural Information Processing Systems, 2023

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models.Advances in Neural Information Processing Systems, 2023. 1, 3

  14. [14]

    I dream my painting: Connecting mllms and diffusion models via prompt generation for text-guided multi-mask inpainting

    Nicola Fanelli, Gennaro Vessio, and Giovanna Castellano. I dream my painting: Connecting mllms and diffusion models via prompt generation for text-guided multi-mask inpainting. InIEEE/CVF Winter Conference on Applications of Computer Vision, 2025. 9

  15. [15]

    Dit4edit: Diffusion transformer for image editing

    Kunyu Feng, Yue Ma, Bingyuan Wang, Chenyang Qi, Haozhe Chen, Qifeng Chen, and Zeyu Wang. Dit4edit: Diffusion transformer for image editing. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2969–2977, 2025. 2

  16. [16]

    Seedream 3.0 Technical Report

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025. 3

  17. [17]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiao- jie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. 3

  18. [18]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning ca- pability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3

  19. [19]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022. 3

  20. [20]

    CLIPScore: A Reference-free Evaluation Metric for Image Captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718, 2021. 5, 15

  21. [21]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 2020. 1, 2, 3

  22. [22]

    Reinforcement learning in generative multimodal ai: A survey.Authorea Preprints, 2026

    Zijing Hu, Junkun Yuan, Kairong Han, Yunze Tong, Shengyu Zhang, Fei Wu, and Kun Kuang. Reinforcement learning in generative multimodal ai: A survey.Authorea Preprints, 2026. 3

  23. [23]

    Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion

    Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, and Qiang Xu. Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. InEuropean Conference on Computer Vision, 2024. 1, 2, 3, 5, 9, 10

  24. [24]

    Rad: Region-aware diffusion models for image inpainting

    Sora Kim, Sungho Suh, and Minsik Lee. Rad: Region-aware diffusion models for image inpainting. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2025. 3

  25. [25]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 2023. 1, 5, 10

  26. [26]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3

  27. [27]

    Calibrated multi-preference optimization for aligning diffusion models

    Kyungmin Lee, Xiahong Li, Qifei Wang, Junfeng He, Junjie Ke, Ming-Hsuan Yang, Irfan Essa, Jinwoo Shin, Feng Yang, and Yinxiao Li. Calibrated multi-preference optimization for aligning diffusion models. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2025. 3, 4, 14

  28. [28]

    Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InInternational con- ference on machine learning, 2022. 5

  29. [29]

    Rorem: Training a robust object remover with human-in-the-loop

    Ruibin Li, Tao Yang, Song Guo, and Lei Zhang. Rorem: Training a robust object remover with human-in-the-loop. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3, 15

  30. [30]

    Swinir: Image restoration us- ing swin transformer

    Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration us- ing swin transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 1

  31. [31]

    Eval- uating text-to-visual generation with image-to-text generation

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Eval- uating text-to-visual generation with image-to-text generation. InEuropean Conference on Computer Vision, 2024. 5

  32. [32]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022. 1, 2, 3

  33. [33]

    Structure matters: Tackling the semantic discrepancy in diffusion models for image inpainting

    Haipeng Liu, Yang Wang, Biao Qian, Meng Wang, and Yong Rui. Structure matters: Tackling the semantic discrepancy in diffusion models for image inpainting. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024. 10

  34. [34]

    Improving Video Generation with Human Feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Meng- han Xia, et al. Improving video generation with human feed- back.arXiv preprint arXiv:2501.13918, 2025. 1, 3

  35. [35]

    Prefpaint: Aligning image inpainting diffusion model with human preference.Advances in Neural Information Processing Systems, 2024

    Kendong Liu, Zhiyu Zhu, Chuanhao Li, Hui Liu, Huanqiang Zeng, and Junhui Hou. Prefpaint: Aligning image inpainting diffusion model with human preference.Advances in Neural Information Processing Systems, 2024. 3, 10 17

  36. [36]

    Repaint: Inpainting us- ing denoising diffusion probabilistic models

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting us- ing denoising diffusion probabilistic models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition,

  37. [37]

    Sit: Exploring flow and diffusion-based generative models with scalable in- terpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable in- terpolant transformers. InEuropean Conference on Computer Vision, 2024. 3

  38. [38]

    Scaling inference time compute for diffusion models

    Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu- Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Scaling inference time compute for diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 5

  39. [39]

    Follow your pose: Pose- guided text-to-video generation using pose-free videos

    Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose- guided text-to-video generation using pose-free videos. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4117–4125, 2024. 2

  40. [40]

    Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation

    Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, et al. Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation. InSIGGRAPH Asia 2024 Conference Papers, 2024

  41. [41]

    Controllable video generation: A survey.arXiv preprint arXiv:2507.16869, 2025

    Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Xuanhua He, Chenyang Zhu, Hongyu Liu, Yingqing He, et al. Controllable video generation: A survey.arXiv preprint arXiv:2507.16869, 2025

  42. [42]

    Follow-your-creation: Empow- ering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025

    Yue Ma, Kunyu Feng, Xinhua Zhang, Hongyu Liu, David Jun- hao Zhang, Jinbo Xing, Yinhan Zhang, Ayden Yang, Zeyu Wang, and Qifeng Chen. Follow-your-creation: Empow- ering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025

  43. [43]

    Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025

    Yue Ma, Yulong Liu, Qiyuan Zhu, Ayden Yang, Kunyu Feng, Xinhua Zhang, Zhifeng Li, Sirui Han, Chenyang Qi, and Qifeng Chen. Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025. 2

  44. [44]

    Hpsv3: Towards wide-spectrum human preference score

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. arXiv preprint arXiv:2508.03789, 2025. 5

  45. [45]

    Group editing: Edit multiple images in one go.arXiv preprint arXiv:2603.22883, 2026

    Yue Ma, Xinyu Wang, Qianli Ma, Qinghe Wang, Mingzhe Zheng, Xiangpeng Yang, Hao Li, Chongbo Zhao, Jixuan Ying, Harry Yang, et al. Group editing: Edit multiple images in one go.arXiv preprint arXiv:2603.22883, 2026. 2

  46. [46]

    Fastvmt: Eliminating redundancy in video motion transfer.arXiv preprint arXiv:2602.05551, 2026

    Yue Ma, Zhikai Wang, Tianhao Ren, Mingzhe Zheng, Hongyu Liu, Jiayi Guo, Mark Fong, Yuxuan Xue, Zixiang Zhao, Kon- rad Schindler, et al. Fastvmt: Eliminating redundancy in video motion transfer.arXiv preprint arXiv:2602.05551, 2026. 2

  47. [47]

    Hd-painter: high-resolution and prompt-faithful text-guided image inpainting with diffusion models

    Hayk Manukyan, Andranik Sargsyan, Barsegh Atanyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Hd-painter: high-resolution and prompt-faithful text-guided image inpainting with diffusion models. InInternational Con- ference on Learning Representations, 2023. 1, 2, 10

  48. [48]

    The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

    Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The ef- fects of reward misspecification: Mapping and mitigating mis- aligned models.arXiv preprint arXiv:2201.03544, 2022. 1, 7

  49. [49]

    Qwen3-vl.https://qwen.ai/blog?id= qwen3-vl, 2025

    QwenTeam. Qwen3-vl.https://qwen.ai/blog?id= qwen3-vl, 2025. Accessed: 2025-11-20. 11

  50. [50]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, 2021. 5

  51. [51]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 2023. 1, 3

  52. [52]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 2, 10

  53. [53]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, 2015. 1, 3

  54. [54]

    Laion-5b: An open large-scale dataset for training next gen- eration image-text models.Advances in neural information processing systems, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gen- eration image-text models.Advances in neural information processing systems, 2022. 5

  55. [55]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3, 4

  56. [56]

    Follow-your-preference: Towards preference- aligned image inpainting.arXiv preprint arXiv:2509.23082,

    Yutao Shen, Junkun Yuan, Toru Aonishi, Hideki Nakayama, and Yue Ma. Follow-your-preference: Towards preference- aligned image inpainting.arXiv preprint arXiv:2509.23082,

  57. [57]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, 2015. 3

  58. [58]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,

  59. [59]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learn- ing: An introduction. MIT press Cambridge, 1998. 3

  60. [60]

    Attention is all you need.Advances in neural information processing systems, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 2017. 1, 3

  61. [61]

    Diffusion model align- ment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024. 1, 3, 12

  62. [62]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 3 18

  63. [63]

    Mod- eling and predicting single-cell multi-gene perturbation responses with scLAMBDA.bioRxiv, 2024a

    Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Tam- ing rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024. 2

  64. [64]

    Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025

    Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pre- training, sft, and rl.arXiv preprint arXiv:2504.11455, 2025. 4

  65. [65]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 5

  66. [66]

    Imagen editor and editbench: Advancing and evaluating text-guided image in- painting

    Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont- Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah Laszlo, David J Fleet, Radu Soricut, et al. Imagen editor and editbench: Advancing and evaluating text-guided image in- painting. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 1, 5

  67. [67]

    Towards enhanced image inpainting: Mit- igating unwanted object insertion and preserving color con- sistency

    Yikai Wang, Chenjie Cao, Junqiu Yu, Ke Fan, Xiangyang Xue, and Yanwei Fu. Towards enhanced image inpainting: Mit- igating unwanted object insertion and preserving color con- sistency. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2, 10

  68. [68]

    Unified Reward Model for Multimodal Understanding and Generation

    Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025. 4, 5

  69. [69]

    Preference alignment on diffusion model: A comprehensive survey for image generation and editing.arXiv preprint arXiv:2502.07829, 2025

    Sihao Wu, Xiaonan Si, Chi Xing, Jianhong Wang, Gaojie Jin, Guangliang Cheng, Lijun Zhang, and Xiaowei Huang. Preference alignment on diffusion model: A comprehensive survey for image generation and editing.arXiv preprint arXiv:2502.07829, 2025. 1

  70. [70]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

  71. [71]

    Imagereward: Learn- ing and evaluating human preferences for text-to-image gen- eration.Advances in Neural Information Processing Systems,

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learn- ing and evaluating human preferences for text-to-image gen- eration.Advances in Neural Information Processing Systems,

  72. [72]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual genera- tion.arXiv preprint arXiv:2505.07818, 2025. 1, 4

  73. [73]

    Label-efficient domain generalization via col- laborative exploration and generalization

    Junkun Yuan, Xu Ma, Defang Chen, Kun Kuang, Fei Wu, and Lanfen Lin. Label-efficient domain generalization via col- laborative exploration and generalization. InProceedings of the 30th ACM international conference on multimedia, pages 2361–2370, 2022. 3

  74. [74]

    Domain-specific bias filtering for single labeled domain generalization.International Journal of Computer Vi- sion, 131(2):552–571, 2023

    Junkun Yuan, Xu Ma, Defang Chen, Kun Kuang, Fei Wu, and Lanfen Lin. Domain-specific bias filtering for single labeled domain generalization.International Journal of Computer Vi- sion, 131(2):552–571, 2023. 2

  75. [75]

    Collaborative semantic aggregation and calibra- tion for federated domain generalization.IEEE Transactions on Knowledge and Data Engineering, 35(12):12528–12541,

    Junkun Yuan, Xu Ma, Defang Chen, Fei Wu, Lanfen Lin, and Kun Kuang. Collaborative semantic aggregation and calibra- tion for federated domain generalization.IEEE Transactions on Knowledge and Data Engineering, 35(12):12528–12541,

  76. [76]

    Hap: Structure-aware masked image modeling for human-centric perception.Advances in Neural Informa- tion Processing Systems, 2023

    Junkun Yuan, Xinyu Zhang, Hao Zhou, Jian Wang, Zhongwei Qiu, Zhiyin Shao, Shaofeng Zhang, Sifan Long, Kun Kuang, Kun Yao, et al. Hap: Structure-aware masked image modeling for human-centric perception.Advances in Neural Informa- tion Processing Systems, 2023. 2

  77. [77]

    Magicbrush: A manually annotated dataset for instruction- guided image editing.Advances in Neural Information Pro- cessing Systems, 2023

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction- guided image editing.Advances in Neural Information Pro- cessing Systems, 2023. 1

  78. [78]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 2, 10

  79. [79]

    The unreasonable effectiveness of deep fea- tures as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep fea- tures as a perceptual metric. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 15

  80. [80]

    A task is worth one word: Learning with task prompts for high-quality versatile image inpainting

    Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, and Kai Chen. A task is worth one word: Learning with task prompts for high-quality versatile image inpainting. InEu- ropean Conference on Computer Vision, 2024. 1, 2, 10 19 Appendix A. More Results on Reward Model Bias Studies (a) A cartoon drawing of a kitchen. (b) A pink lotus flower blooming with gre...