pith. machine review for the scientific record. sign in

arxiv: 2603.02210 · v3 · submitted 2026-03-02 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords image inpaintingreference-based inpaintinghuman-product imagesdetail preservationattention mechanismloss functiondataset curationgenerative models
0
0 comments X

The pith

HiFi-Inpaint uses shared enhancement attention and detail-aware loss to generate high-fidelity human-product images that preserve product details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the problem of creating realistic human-product images for advertising and e-commerce where product details must remain accurate after inpainting. Existing reference-based methods lack sufficient specialized data, fail to emphasize fine product features, and rely on supervision too coarse to guide precise edits. HiFi-Inpaint introduces Shared Enhancement Attention to sharpen product features drawn from reference images and Detail-Aware Loss to enforce pixel-level accuracy through high-frequency maps, trained on the newly built HP-Image-40K dataset. A sympathetic reader cares because better detail preservation directly improves visual realism in commercial imagery. The central claim is that these additions together deliver state-of-the-art results on this task.

Core claim

HiFi-Inpaint is a reference-based inpainting framework that introduces Shared Enhancement Attention (SEA) to refine fine-grained product features and Detail-Aware Loss (DAL) to enforce precise pixel-level supervision using high-frequency maps, trained on the automatically curated HP-Image-40K dataset, and achieves state-of-the-art performance in generating detail-preserving human-product images.

What carries the argument

Shared Enhancement Attention (SEA) combined with Detail-Aware Loss (DAL), which together refine product features from references and supply high-frequency pixel supervision during inpainting.

If this is right

  • The generated images retain higher visual fidelity for use in advertising and e-commerce.
  • The framework outperforms prior reference-based inpainting models on metrics of detail preservation.
  • The HP-Image-40K dataset supports more effective training for product-specific image synthesis.
  • High-frequency map supervision leads to better retention of small-scale product features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The automatic curation pipeline could be adapted to build specialized datasets for other narrow-domain generation tasks.
  • The approach may combine with diffusion models to further increase realism while keeping reference guidance.
  • Similar attention and loss designs could apply to detail-critical editing in medical or technical imagery.

Load-bearing premise

The assumption that the new attention module and loss, trained on the HP-Image-40K dataset, will produce generalizable improvements in product detail fidelity beyond the specific training distribution.

What would settle it

A set of test images from a different product domain or real-world capture where the model visibly distorts or loses fine textures and logos would show the improvements do not generalize.

read the original abstract

Human-product images, which showcase the integration of humans and products, play a vital role in advertising, e-commerce, and digital marketing. The essential challenge of generating such images lies in ensuring the high-fidelity preservation of product details. Among existing paradigms, reference-based inpainting offers a targeted solution by leveraging product reference images to guide the inpainting process. However, limitations remain in three key aspects: the lack of diverse large-scale training data, the struggle of current models to focus on product detail preservation, and the inability of coarse supervision for achieving precise guidance. To address these issues, we propose HiFi-Inpaint, a novel high-fidelity reference-based inpainting framework tailored for generating human-product images. HiFi-Inpaint introduces Shared Enhancement Attention (SEA) to refine fine-grained product features and Detail-Aware Loss (DAL) to enforce precise pixel-level supervision using high-frequency maps. Additionally, we construct a new dataset, HP-Image-40K, with samples curated from self-synthesis data and processed with automatic filtering. Experimental results show that HiFi-Inpaint achieves state-of-the-art performance, delivering detail-preserving human-product images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes HiFi-Inpaint, a reference-based inpainting framework for high-fidelity human-product image generation. It introduces Shared Enhancement Attention (SEA) to refine fine-grained product features, Detail-Aware Loss (DAL) for pixel-level supervision via high-frequency maps, and the automatically curated HP-Image-40K dataset. The central claim is that these components enable state-of-the-art detail-preserving results in human-product composites for applications such as e-commerce.

Significance. If the experimental claims hold with rigorous validation, the work could advance practical inpainting for advertising and retail imagery by improving product detail fidelity. The dataset contribution and the two new modules (SEA and DAL) represent targeted engineering advances, though their generalizability beyond the training distribution remains to be demonstrated.

major comments (3)
  1. [Experimental Results] Experimental Results section: The abstract asserts SOTA performance, yet the provided text supplies no quantitative metrics (e.g., PSNR, SSIM, LPIPS, FID), baseline comparisons, ablation studies, or error analysis. Without these, the data-to-claim link for the superiority of SEA + DAL + HP-Image-40K cannot be evaluated.
  2. [Method] Method section (description of DAL): The Detail-Aware Loss is defined using high-frequency maps, but no derivation or ablation shows that this formulation is necessary versus standard perceptual or reconstruction losses; the load-bearing claim that it enforces 'precise pixel-level supervision' requires explicit quantitative support.
  3. [Dataset] Dataset section: The HP-Image-40K curation process is described as automatic filtering of self-synthesis data, but no statistics on filtering criteria, diversity metrics, or train/test split details are given, undermining claims of improved generalizability.
minor comments (2)
  1. [Method] Notation for SEA is introduced without a clear equation or diagram reference; add a formal definition or pseudocode in §3.
  2. [Figures] Figure captions for qualitative results should include the specific baselines shown and the metric values for each example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that the current manuscript version requires strengthening in the experimental validation, justification of the Detail-Aware Loss, and dataset documentation. We will revise the paper to address each point with additional quantitative evidence, ablations, and statistics while preserving the core contributions of SEA, DAL, and the HP-Image-40K dataset.

read point-by-point responses
  1. Referee: [Experimental Results] Experimental Results section: The abstract asserts SOTA performance, yet the provided text supplies no quantitative metrics (e.g., PSNR, SSIM, LPIPS, FID), baseline comparisons, ablation studies, or error analysis. Without these, the data-to-claim link for the superiority of SEA + DAL + HP-Image-40K cannot be evaluated.

    Authors: We acknowledge the omission in the submitted version. The revised manuscript will expand the Experimental Results section to include full quantitative tables reporting PSNR, SSIM, LPIPS, and FID scores against multiple reference-based inpainting baselines. We will also present ablation studies isolating the contributions of SEA and DAL, along with error analysis on failure cases. These additions will explicitly link the metrics to the superiority claims for the combined SEA + DAL + HP-Image-40K pipeline. revision: yes

  2. Referee: [Method] Method section (description of DAL): The Detail-Aware Loss is defined using high-frequency maps, but no derivation or ablation shows that this formulation is necessary versus standard perceptual or reconstruction losses; the load-bearing claim that it enforces 'precise pixel-level supervision' requires explicit quantitative support.

    Authors: We will add a short derivation subsection explaining the high-frequency map construction and its motivation for pixel-level detail preservation. The revised version will include a dedicated ablation table comparing DAL against standard L1, perceptual (VGG), and adversarial losses, reporting quantitative gains in high-frequency fidelity metrics. This will provide the requested empirical support for the claim of precise supervision. revision: yes

  3. Referee: [Dataset] Dataset section: The HP-Image-40K curation process is described as automatic filtering of self-synthesis data, but no statistics on filtering criteria, diversity metrics, or train/test split details are given, undermining claims of improved generalizability.

    Authors: We will expand the Dataset section with explicit statistics: filtering thresholds (e.g., minimum resolution, product visibility scores), diversity metrics (pose variance, product category distribution, background complexity), and the exact train/validation/test split ratios. These details will be accompanied by a brief analysis of how the curation improves coverage relative to prior datasets, directly supporting the generalizability claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes new architectural modules (SEA, DAL) and a curated dataset (HP-Image-40K) whose performance claims rest on experimental comparisons rather than any closed-form derivation, fitted parameter renamed as prediction, or self-citation chain. No equations are presented that reduce the claimed fidelity gains to quantities defined by the inputs themselves; the central improvements are introduced as novel and evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the effectiveness of two newly introduced modules and an automatically filtered dataset whose construction details are not provided in the abstract.

axioms (1)
  • domain assumption Standard deep-learning training assumptions for conditional image generation hold for the reference-based inpainting task.
    The framework is presented as an extension of existing reference-based inpainting paradigms without additional justification of training stability or convergence.
invented entities (2)
  • Shared Enhancement Attention (SEA) no independent evidence
    purpose: Refine fine-grained product features during inpainting
    New attention mechanism introduced to focus on product details.
  • Detail-Aware Loss (DAL) no independent evidence
    purpose: Enforce precise pixel-level supervision using high-frequency maps
    New loss term proposed for detail preservation.

pith-pipeline@v0.9.0 · 5550 in / 1283 out tokens · 53662 ms · 2026-05-15T17:22:37.971541+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Blended latent diffusion.ACM transactions on graphics (TOG), 42(4):1–11, 2023

    Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion.ACM transactions on graphics (TOG), 42(4):1–11, 2023

  2. [2]

    Integrating view conditions for image synthesis.arXiv preprint arXiv:2310.16002, 2023

    Jinbin Bai, Zhen Dong, Aosong Feng, Xiao Zhang, Tian Ye, and Kaicheng Zhou. Integrating view conditions for image synthesis.arXiv preprint arXiv:2310.16002, 2023

  3. [3]

    Humanedit: A high-quality human-rewarded dataset for instruction-based image editing.arXiv preprint arXiv:2412.04280, 2024

    Jinbin Bai, Wei Chow, Ling Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, and Shuicheng Yan. Humanedit: A high-quality human-rewarded dataset for instruction-based image editing.arXiv preprint arXiv:2412.04280, 2024

  4. [4]

    Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image synthesis

    Jinbin Bai, Tian Ye, Wei Chow, Enxin Song, Qing-Guo Chen, Xiangtai Li, Zhen Dong, Lei Zhu, and Shuicheng Yan. Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image synthesis. InThe Thirteenth International Conference on Learning Representations, 2024

  5. [5]

    Image inpainting

    Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. Image inpainting. InProceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 417–424, 2000

  6. [6]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  7. [7]

    Black Forest Labs. Flux. https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev, 2025

  8. [8]

    Diffusion self-distillation for zero-shot customized image generation

    Shengqu Cai, Eric Ryan Chan, Yunzhi Zhang, Leonidas Guibas, Jiajun Wu, and Gordon Wetzstein. Diffusion self-distillation for zero-shot customized image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18434–18443, 2025

  9. [9]

    Relactrl: Relevance-guided efficient control for diffusion transformers.arXiv preprint arXiv:2502.14377, 2025

    Ke Cao, Jing Wang, Ao Ma, Jiasong Feng, Zhanjie Zhang, Xuanhua He, Shanyuan Liu, Bo Cheng, Dawei Leng, Yuhui Yin, et al. Relactrl: Relevance-guided efficient control for diffusion transformers.arXiv preprint arXiv:2502.14377, 2025

  10. [10]

    Controllable generation with text-to-image diffusion models: A survey.arXiv preprint arXiv:2403.04279, 2024

    Pu Cao, Feng Zhou, Qing Song, and Lu Yang. Controllable generation with text-to-image diffusion models: A survey.arXiv preprint arXiv:2403.04279, 2024

  11. [11]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  12. [12]

    An empirical study of gpt-4o image generation capabilities.arXiv preprint arXiv:2504.05979, 2025

    Sixiang Chen, Jinbin Bai, Zhuoran Zhao, Tian Ye, Qingyu Shi, Donghao Zhou, Wenhao Chai, Xin Lin, Jianzong Wu, Chao Tang, et al. An empirical study of gpt-4o image generation capabilities.arXiv preprint arXiv:2504.05979, 2025

  13. [13]

    Postercraft: Rethinking high-quality aesthetic poster generation in a unified framework

    SiXiang Chen, Jianyu Lai, Jialin Gao, Tian Ye, Haoyu Chen, Hengyu Shi, Shitong Shao, Yunlong Lin, Song Fei, Zhaohu Xing, et al. Postercraft: Rethinking high-quality aesthetic poster generation in a unified framework. arXiv preprint arXiv:2506.10741, 2025

  14. [14]

    Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

    Sixiang Chen, Jianyu Lai, Jialin Gao, Hengyu Shi, Zhongying Liu, Tian Ye, Junfeng Luo, Xiaoming Wei, and Lei Zhu. Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

  15. [15]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

  16. [16]

    Region filling and object removal by exemplar-based image inpainting.IEEE Transactions on image processing, 13(9):1200–1212, 2004

    Antonio Criminisi, Patrick Pérez, and Kentaro Toyama. Region filling and object removal by exemplar-based image inpainting.IEEE Transactions on image processing, 13(9):1200–1212, 2004

  17. [17]

    Generalized and separable sobel operators

    Per-Erik Danielsson and Olle Seger. Generalized and separable sobel operators. InMachine vision for three- dimensional scenes, pages 347–379. Elsevier, 1990

  18. [18]

    Plangen: Towards unified layout planning and image generation in auto-regressive vision language models, 2025

    Runze He, Bo Cheng, Yuhang Ma, Qingxiang Jia, Shanyuan Liu, Ao Ma, Xiaoyu Wu, Liebucha Wu, Dawei Leng, and Yuhui Yin. Plangen: Towards unified layout planning and image generation in auto-regressive vision language models, 2025. URLhttps://arxiv.org/abs/2503.10127

  19. [19]

    Re-align: Structured reasoning-guided alignment for in-context image generation and editing.arXiv preprint arXiv:2601.05124, 2026

    Runze He, Yiji Cheng, Tiankai Hang, Zhimin Li, Yu Xu, Zijin Yin, Shiyi Zhang, Wenxun Dai, Penghui Du, Ao Ma, et al. Re-align: Structured reasoning-guided alignment for in-context image generation and editing.arXiv preprint arXiv:2601.05124, 2026. 11

  20. [20]

    Dual-schedule inversion: Training-and tuning-free inversion for real image editing

    Jiancheng Huang, Yi Huang, Jianzhuang Liu, Donghao Zhou, Yifan Liu, and Shifeng Chen. Dual-schedule inversion: Training-and tuning-free inversion for real image editing. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 660–669. IEEE, 2025

  21. [21]

    Diffusion model-based image editing: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Yi Huang, Jiancheng Huang, Yifan Liu, Mingfu Yan, Jiaxi Lv, Jianzhuang Liu, Wei Xiong, He Zhang, Liangliang Cao, and Shifeng Chen. Diffusion model-based image editing: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  22. [22]

    Springer, 2005

    Bernd Jähne.Digital image processing. Springer, 2005

  23. [23]

    Laion-aesthetics-predictor v1, 2022

    LAION-AI. Laion-aesthetics-predictor v1, 2022. URLhttps://github.com/LAION-AI/aesthetic-predictor

  24. [24]

    Ic-custom: Diverse image customization via in-context learning.arXiv preprint arXiv:2507.01926, 2025

    Yaowei Li, Xiaoyu Li, Zhaoyang Zhang, Yuxuan Bian, Gan Liu, Xinyuan Li, Jiale Xu, Wenbo Hu, Yating Liu, Lingen Li, et al. Ic-custom: Diverse image customization via in-context learning.arXiv preprint arXiv:2507.01926, 2025

  25. [25]

    Pair-id: A dual modal framework for identity preserving image generation.IEEE Signal Processing Letters, 2024

    Jingyu Lin, Yongrong Wu, Zeyu Wang, Xiaode Liu, and Yufei Guo. Pair-id: A dual modal framework for identity preserving image generation.IEEE Signal Processing Letters, 2024

  26. [26]

    Difftv: Identity-preserved thermal-to-visible face translation via feature alignment and dual-stage conditions

    Jingyu Lin, Guiqin Zhao, Jing Xu, Guoli Wang, Zejin Wang, Antitza Dantcheva, Lan Du, and Cunjian Chen. Difftv: Identity-preserved thermal-to-visible face translation via feature alignment and dual-stage conditions. In ACM Multimedia 2024, 2024

  27. [27]

    Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent

    Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding, Wenbo Li, et al. Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent. arXiv preprint arXiv:2506.17612, 2025

  28. [28]

    Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

    Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, et al. Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

  29. [29]

    Mofu: Scale-aware modulation and fourier fusion for multi-subject video generation.arXiv preprint arXiv:2512.22310, 2025

    Run Ling, Ke Cao, Jian Lu, Ao Ma, Haowei Liu, Runze He, Changwei Wang, Rongtao Xu, Yihua Shao, Zhanjie Zhang, et al. Mofu: Scale-aware modulation and fourier fusion for multi-subject video generation.arXiv preprint arXiv:2512.22310, 2025

  30. [30]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  31. [31]

    Lay2story: extending diffusion transformers for layout-togglable story generation

    Ao Ma, Jiasong Feng, Ke Cao, Jing Wang, Yun Wang, Quanwei Zhang, and Zhanjie Zhang. Lay2story: extending diffusion transformers for layout-togglable story generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16102–16111, 2025

  32. [32]

    Hd-painter: high-resolution and prompt-faithful text-guided image inpainting with diffusion models

    Hayk Manukyan, Andranik Sargsyan, Barsegh Atanyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Hd-painter: high-resolution and prompt-faithful text-guided image inpainting with diffusion models. InThe Thirteenth International Conference on Learning Representations, 2023

  33. [33]

    Ace++: Instruction-based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025

    Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction-based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025

  34. [34]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  35. [35]

    Mirrorgan: Learning text-to-image generation by redescription

    Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. Mirrorgan: Learning text-to-image generation by redescription. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1505–1514, 2019

  36. [36]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

  37. [37]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021. 12

  38. [38]

    Generative adversarial text to image synthesis

    Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. InInternational conference on machine learning, pages 1060–1069. PMLR, 2016

  39. [39]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022

  40. [40]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

  41. [41]

    Palette: Image-to-image diffusion models

    Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. InACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022

  42. [42]

    Sg-adapter: Enhancing text-to-image generation with scene graph guidance.arXiv preprint arXiv:2405.15321, 2024

    Guibao Shen, Luozhou Wang, Jiantao Lin, Wenhang Ge, Chaozhe Zhang, Xin Tao, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Guangyong Chen, et al. Sg-adapter: Enhancing text-to-image generation with scene graph guidance.arXiv preprint arXiv:2405.15321, 2024

  43. [43]

    Scenedecorator: Towards scene-oriented story generation with scene planning and scene consistency.arXiv preprint arXiv:2510.22994, 2025

    Quanjian Song, Donghao Zhou, Jingyu Lin, Fei Shen, Jiaze Wang, Xiaowei Hu, Cunjian Chen, and Pheng-Ann Heng. Scenedecorator: Towards scene-oriented story generation with scene planning and scene consistency.arXiv preprint arXiv:2510.22994, 2025

  44. [44]

    Insert anything: Image insertion via in-context editing in dit.arXiv preprint arXiv:2504.15009, 2025

    Wensong Song, Hong Jiang, Zongxing Yang, Ruijie Quan, and Yi Yang. Insert anything: Image insertion via in-context editing in dit.arXiv preprint arXiv:2504.15009, 2025

  45. [45]

    Ominicontrol: Minimal and universal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024

    Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024

  46. [46]

    Yolov8, 2023

    ultralytics. Yolov8, 2023. URLhttps://github.com/ultralytics/ultralytics

  47. [47]

    Diffx: Guide your layout to cross-modal generative modeling.arXiv preprint arXiv:2407.15488, 2024

    Zeyu Wang, Jingyu Lin, Yifei Qian, Yi Huang, Shicen Tian, Bosong Chai, Juncan Deng, Lan Du, Cunjian Chen, Yufei Guo, et al. Diffx: Guide your layout to cross-modal generative modeling.arXiv preprint arXiv:2407.15488, 2024

  48. [48]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

  49. [49]

    Personalized image generation with deep generative models: A decade survey.arXiv preprint arXiv:2502.13081, 2025

    Yuxiang Wei, Yiheng Zheng, Yabo Zhang, Ming Liu, Zhilong Ji, Lei Zhang, and Wangmeng Zuo. Personalized image generation with deep generative models: A decade survey.arXiv preprint arXiv:2502.13081, 2025

  50. [50]

    Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

    Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023

  51. [51]

    Smartbrush: Text and shape guided object inpainting with diffusion model

    Shaoan Xie, Zhifei Zhang, Zhe Lin, Tobias Hinz, and Kun Zhang. Smartbrush: Text and shape guided object inpainting with diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22428–22437, 2023

  52. [52]

    Attngan: Fine-grained text to image generation with attentional generative adversarial networks

    Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018

  53. [53]

    Paint by example: Exemplar-based image editing with diffusion models

    Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18381–18391, 2023

  54. [54]

    Uni-paint: A unified framework for multimodal image inpainting with pretrained diffusion model

    Shiyuan Yang, Xiaodong Chen, and Jing Liao. Uni-paint: A unified framework for multimodal image inpainting with pretrained diffusion model. InProceedings of the 31st ACM International Conference on Multimedia, pages 3190–3199, 2023

  55. [55]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022. 13

  56. [56]

    Trade-offs in image generation: How do different dimensions interact?arXiv preprint arXiv:2507.22100, 2025

    Sicheng Zhang, Binzhu Xie, Zhonghao Yan, Yuli Zhang, Donghao Zhou, Xiaofei Chen, Shi Qiu, Jiaqi Liu, Guoyang Xie, and Zhichao Lu. Trade-offs in image generation: How do different dimensions interact?arXiv preprint arXiv:2507.22100, 2025

  57. [57]

    Artbank: Artistic style transfer with pre-trained diffusion model and implicit style prompt bank

    Zhanjie Zhang, Quanwei Zhang, Wei Xing, Guangyuan Li, Lei Zhao, Jiakai Sun, Zehua Lan, Junsheng Luan, Yiling Huang, and Huaizhong Lin. Artbank: Artistic style transfer with pre-trained diffusion model and implicit style prompt bank. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 7396–7404, 2024

  58. [58]

    U-stydit: Ultra-high quality artistic style transfer using diffusion transformers.arXiv preprint arXiv:2503.08157, 2025

    Zhanjie Zhang, Ao Ma, Ke Cao, Jing Wang, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, and Yuhui Yin. U-stydit: Ultra-high quality artistic style transfer using diffusion transformers.arXiv preprint arXiv:2503.08157, 2025

  59. [59]

    Objectadd: adding objects into image via a training-free diffusion modification fashion.arXiv preprint arXiv:2404.17230, 2024

    Ziyue Zhang, Quanjian Song, Yuxin Zhang, and Rongrong Ji. Objectadd: adding objects into image via a training-free diffusion modification fashion.arXiv preprint arXiv:2404.17230, 2024

  60. [60]

    Magictailor: Component-controllable personalization in text-to-image diffusion models.arXiv preprint arXiv:2410.13370, 2024

    Donghao Zhou, Jiancheng Huang, Jinbin Bai, Jiaze Wang, Hao Chen, Guangyong Chen, Xiaowei Hu, and Pheng-Ann Heng. Magictailor: Component-controllable personalization in text-to-image diffusion models.arXiv preprint arXiv:2410.13370, 2024

  61. [61]

    Change the object in the black square to the product in the left image

    Donghao Zhou, Jingyu Lin, Guibao Shen, Quande Liu, Jialin Gao, Lihao Liu, Lan Du, Cunjian Chen, Chi-Wing Fu, Xiaowei Hu, et al. Identitystory: Taming your identity-preserving generator for human-centric story generation. arXiv preprint arXiv:2512.23519, 2025. 14 A HP-Image-40K Dataset Statistics To better demonstrate the diversity and comprehensiveness of...