arxiv: 2603.02210 · v3 · submitted 2026-03-02 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

Yichen Liu , Donghao Zhou , Jie Wang , Xin Gao , Guisheng Liu , Jiatong Li , Quanwei Zhang , Qiang Lyu

show 4 more authors

Lanqing Guo Shilei Wen Weiqiang Wang Pheng-Ann Heng

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords image inpaintingreference-based inpaintinghuman-product imagesdetail preservationattention mechanismloss functiondataset curationgenerative models

0 comments

The pith

HiFi-Inpaint uses shared enhancement attention and detail-aware loss to generate high-fidelity human-product images that preserve product details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the problem of creating realistic human-product images for advertising and e-commerce where product details must remain accurate after inpainting. Existing reference-based methods lack sufficient specialized data, fail to emphasize fine product features, and rely on supervision too coarse to guide precise edits. HiFi-Inpaint introduces Shared Enhancement Attention to sharpen product features drawn from reference images and Detail-Aware Loss to enforce pixel-level accuracy through high-frequency maps, trained on the newly built HP-Image-40K dataset. A sympathetic reader cares because better detail preservation directly improves visual realism in commercial imagery. The central claim is that these additions together deliver state-of-the-art results on this task.

Core claim

HiFi-Inpaint is a reference-based inpainting framework that introduces Shared Enhancement Attention (SEA) to refine fine-grained product features and Detail-Aware Loss (DAL) to enforce precise pixel-level supervision using high-frequency maps, trained on the automatically curated HP-Image-40K dataset, and achieves state-of-the-art performance in generating detail-preserving human-product images.

What carries the argument

Shared Enhancement Attention (SEA) combined with Detail-Aware Loss (DAL), which together refine product features from references and supply high-frequency pixel supervision during inpainting.

If this is right

The generated images retain higher visual fidelity for use in advertising and e-commerce.
The framework outperforms prior reference-based inpainting models on metrics of detail preservation.
The HP-Image-40K dataset supports more effective training for product-specific image synthesis.
High-frequency map supervision leads to better retention of small-scale product features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The automatic curation pipeline could be adapted to build specialized datasets for other narrow-domain generation tasks.
The approach may combine with diffusion models to further increase realism while keeping reference guidance.
Similar attention and loss designs could apply to detail-critical editing in medical or technical imagery.

Load-bearing premise

The assumption that the new attention module and loss, trained on the HP-Image-40K dataset, will produce generalizable improvements in product detail fidelity beyond the specific training distribution.

What would settle it

A set of test images from a different product domain or real-world capture where the model visibly distorts or loses fine textures and logos would show the improvements do not generalize.

read the original abstract

Human-product images, which showcase the integration of humans and products, play a vital role in advertising, e-commerce, and digital marketing. The essential challenge of generating such images lies in ensuring the high-fidelity preservation of product details. Among existing paradigms, reference-based inpainting offers a targeted solution by leveraging product reference images to guide the inpainting process. However, limitations remain in three key aspects: the lack of diverse large-scale training data, the struggle of current models to focus on product detail preservation, and the inability of coarse supervision for achieving precise guidance. To address these issues, we propose HiFi-Inpaint, a novel high-fidelity reference-based inpainting framework tailored for generating human-product images. HiFi-Inpaint introduces Shared Enhancement Attention (SEA) to refine fine-grained product features and Detail-Aware Loss (DAL) to enforce precise pixel-level supervision using high-frequency maps. Additionally, we construct a new dataset, HP-Image-40K, with samples curated from self-synthesis data and processed with automatic filtering. Experimental results show that HiFi-Inpaint achieves state-of-the-art performance, delivering detail-preserving human-product images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds two targeted modules (SEA for feature refinement and DAL for high-frequency supervision) plus a new 40K dataset to improve reference-based inpainting on human-product scenes, but the SOTA claim still needs the actual numbers and ablations to hold up.

read the letter

The core contribution is a practical tweak to reference-guided inpainting aimed at e-commerce images where product details must survive the merge with a human figure. SEA shares attention across reference and target to keep fine product textures, while DAL adds a loss term driven by high-frequency maps for tighter pixel supervision. They also release HP-Image-40K, built by self-synthesis and automatic filtering. Those pieces are new and directly address the three gaps they name: data scale, detail focus, and supervision precision. The approach is straightforward and the motivation is clear from real use cases in advertising. On the positive side, the components look like they could help with the stated failure modes, and shipping a dataset of that size is useful even if the curation is automatic. The soft spots are the usual ones for this style of paper: the abstract and available text give no tables, no baseline numbers, no ablation breakdowns, and no error analysis, so it is impossible to judge whether the gains are large, consistent, or just from a favorable split. The dataset itself may carry distribution bias because it comes from self-synthesis rather than diverse real captures. Generalization beyond their training distribution is asserted but not tested in the visible material. This is the kind of work that belongs in a reading group for people doing image synthesis or e-commerce CV; the ideas are concrete enough that a referee could give useful feedback on the experiments and dataset construction. I would send it to review rather than desk-reject, with the expectation that the authors supply the missing quantitative evidence.

Referee Report

3 major / 2 minor

Summary. The paper proposes HiFi-Inpaint, a reference-based inpainting framework for high-fidelity human-product image generation. It introduces Shared Enhancement Attention (SEA) to refine fine-grained product features, Detail-Aware Loss (DAL) for pixel-level supervision via high-frequency maps, and the automatically curated HP-Image-40K dataset. The central claim is that these components enable state-of-the-art detail-preserving results in human-product composites for applications such as e-commerce.

Significance. If the experimental claims hold with rigorous validation, the work could advance practical inpainting for advertising and retail imagery by improving product detail fidelity. The dataset contribution and the two new modules (SEA and DAL) represent targeted engineering advances, though their generalizability beyond the training distribution remains to be demonstrated.

major comments (3)

[Experimental Results] Experimental Results section: The abstract asserts SOTA performance, yet the provided text supplies no quantitative metrics (e.g., PSNR, SSIM, LPIPS, FID), baseline comparisons, ablation studies, or error analysis. Without these, the data-to-claim link for the superiority of SEA + DAL + HP-Image-40K cannot be evaluated.
[Method] Method section (description of DAL): The Detail-Aware Loss is defined using high-frequency maps, but no derivation or ablation shows that this formulation is necessary versus standard perceptual or reconstruction losses; the load-bearing claim that it enforces 'precise pixel-level supervision' requires explicit quantitative support.
[Dataset] Dataset section: The HP-Image-40K curation process is described as automatic filtering of self-synthesis data, but no statistics on filtering criteria, diversity metrics, or train/test split details are given, undermining claims of improved generalizability.

minor comments (2)

[Method] Notation for SEA is introduced without a clear equation or diagram reference; add a formal definition or pseudocode in §3.
[Figures] Figure captions for qualitative results should include the specific baselines shown and the metric values for each example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that the current manuscript version requires strengthening in the experimental validation, justification of the Detail-Aware Loss, and dataset documentation. We will revise the paper to address each point with additional quantitative evidence, ablations, and statistics while preserving the core contributions of SEA, DAL, and the HP-Image-40K dataset.

read point-by-point responses

Referee: [Experimental Results] Experimental Results section: The abstract asserts SOTA performance, yet the provided text supplies no quantitative metrics (e.g., PSNR, SSIM, LPIPS, FID), baseline comparisons, ablation studies, or error analysis. Without these, the data-to-claim link for the superiority of SEA + DAL + HP-Image-40K cannot be evaluated.

Authors: We acknowledge the omission in the submitted version. The revised manuscript will expand the Experimental Results section to include full quantitative tables reporting PSNR, SSIM, LPIPS, and FID scores against multiple reference-based inpainting baselines. We will also present ablation studies isolating the contributions of SEA and DAL, along with error analysis on failure cases. These additions will explicitly link the metrics to the superiority claims for the combined SEA + DAL + HP-Image-40K pipeline. revision: yes
Referee: [Method] Method section (description of DAL): The Detail-Aware Loss is defined using high-frequency maps, but no derivation or ablation shows that this formulation is necessary versus standard perceptual or reconstruction losses; the load-bearing claim that it enforces 'precise pixel-level supervision' requires explicit quantitative support.

Authors: We will add a short derivation subsection explaining the high-frequency map construction and its motivation for pixel-level detail preservation. The revised version will include a dedicated ablation table comparing DAL against standard L1, perceptual (VGG), and adversarial losses, reporting quantitative gains in high-frequency fidelity metrics. This will provide the requested empirical support for the claim of precise supervision. revision: yes
Referee: [Dataset] Dataset section: The HP-Image-40K curation process is described as automatic filtering of self-synthesis data, but no statistics on filtering criteria, diversity metrics, or train/test split details are given, undermining claims of improved generalizability.

Authors: We will expand the Dataset section with explicit statistics: filtering thresholds (e.g., minimum resolution, product visibility scores), diversity metrics (pose variance, product category distribution, background complexity), and the exact train/validation/test split ratios. These details will be accompanied by a brief analysis of how the curation improves coverage relative to prior datasets, directly supporting the generalizability claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes new architectural modules (SEA, DAL) and a curated dataset (HP-Image-40K) whose performance claims rest on experimental comparisons rather than any closed-form derivation, fitted parameter renamed as prediction, or self-citation chain. No equations are presented that reduce the claimed fidelity gains to quantities defined by the inputs themselves; the central improvements are introduced as novel and evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the effectiveness of two newly introduced modules and an automatically filtered dataset whose construction details are not provided in the abstract.

axioms (1)

domain assumption Standard deep-learning training assumptions for conditional image generation hold for the reference-based inpainting task.
The framework is presented as an extension of existing reference-based inpainting paradigms without additional justification of training stability or convergence.

invented entities (2)

Shared Enhancement Attention (SEA) no independent evidence
purpose: Refine fine-grained product features during inpainting
New attention mechanism introduced to focus on product details.
Detail-Aware Loss (DAL) no independent evidence
purpose: Enforce precise pixel-level supervision using high-frequency maps
New loss term proposed for detail preservation.

pith-pipeline@v0.9.0 · 5550 in / 1283 out tokens · 53662 ms · 2026-05-15T17:22:37.971541+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HiFi-Inpaint introduces Shared Enhancement Attention (SEA) to refine fine-grained product features and Detail-Aware Loss (DAL) to enforce precise pixel-level supervision using high-frequency maps.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We design a high-frequency map-guided DiT framework that employs a token merging mechanism

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Blended latent diffusion.ACM transactions on graphics (TOG), 42(4):1–11, 2023

Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion.ACM transactions on graphics (TOG), 42(4):1–11, 2023

work page 2023
[2]

Integrating view conditions for image synthesis.arXiv preprint arXiv:2310.16002, 2023

Jinbin Bai, Zhen Dong, Aosong Feng, Xiao Zhang, Tian Ye, and Kaicheng Zhou. Integrating view conditions for image synthesis.arXiv preprint arXiv:2310.16002, 2023

work page arXiv 2023
[3]

Humanedit: A high-quality human-rewarded dataset for instruction-based image editing.arXiv preprint arXiv:2412.04280, 2024

Jinbin Bai, Wei Chow, Ling Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, and Shuicheng Yan. Humanedit: A high-quality human-rewarded dataset for instruction-based image editing.arXiv preprint arXiv:2412.04280, 2024

work page arXiv 2024
[4]

Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image synthesis

Jinbin Bai, Tian Ye, Wei Chow, Enxin Song, Qing-Guo Chen, Xiangtai Li, Zhen Dong, Lei Zhu, and Shuicheng Yan. Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image synthesis. InThe Thirteenth International Conference on Learning Representations, 2024

work page 2024
[5]

Image inpainting

Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. Image inpainting. InProceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 417–424, 2000

work page 2000
[6]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[7]

Black Forest Labs. Flux. https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev, 2025

work page 2025
[8]

Diffusion self-distillation for zero-shot customized image generation

Shengqu Cai, Eric Ryan Chan, Yunzhi Zhang, Leonidas Guibas, Jiajun Wu, and Gordon Wetzstein. Diffusion self-distillation for zero-shot customized image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18434–18443, 2025

work page 2025
[9]

Relactrl: Relevance-guided efficient control for diffusion transformers.arXiv preprint arXiv:2502.14377, 2025

Ke Cao, Jing Wang, Ao Ma, Jiasong Feng, Zhanjie Zhang, Xuanhua He, Shanyuan Liu, Bo Cheng, Dawei Leng, Yuhui Yin, et al. Relactrl: Relevance-guided efficient control for diffusion transformers.arXiv preprint arXiv:2502.14377, 2025

work page arXiv 2025
[10]

Controllable generation with text-to-image diffusion models: A survey.arXiv preprint arXiv:2403.04279, 2024

Pu Cao, Feng Zhou, Qing Song, and Lu Yang. Controllable generation with text-to-image diffusion models: A survey.arXiv preprint arXiv:2403.04279, 2024

work page arXiv 2024
[11]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

work page 2021
[12]

An empirical study of gpt-4o image generation capabilities.arXiv preprint arXiv:2504.05979, 2025

Sixiang Chen, Jinbin Bai, Zhuoran Zhao, Tian Ye, Qingyu Shi, Donghao Zhou, Wenhao Chai, Xin Lin, Jianzong Wu, Chao Tang, et al. An empirical study of gpt-4o image generation capabilities.arXiv preprint arXiv:2504.05979, 2025

work page arXiv 2025
[13]

Postercraft: Rethinking high-quality aesthetic poster generation in a unified framework

SiXiang Chen, Jianyu Lai, Jialin Gao, Tian Ye, Haoyu Chen, Hengyu Shi, Shitong Shao, Yunlong Lin, Song Fei, Zhaohu Xing, et al. Postercraft: Rethinking high-quality aesthetic poster generation in a unified framework. arXiv preprint arXiv:2506.10741, 2025

work page arXiv 2025
[14]

Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

Sixiang Chen, Jianyu Lai, Jialin Gao, Hengyu Shi, Zhongying Liu, Tian Ye, Junfeng Luo, Xiaoming Wei, and Lei Zhu. Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

work page arXiv 2026
[15]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

work page 2024
[16]

Region filling and object removal by exemplar-based image inpainting.IEEE Transactions on image processing, 13(9):1200–1212, 2004

Antonio Criminisi, Patrick Pérez, and Kentaro Toyama. Region filling and object removal by exemplar-based image inpainting.IEEE Transactions on image processing, 13(9):1200–1212, 2004

work page 2004
[17]

Generalized and separable sobel operators

Per-Erik Danielsson and Olle Seger. Generalized and separable sobel operators. InMachine vision for three- dimensional scenes, pages 347–379. Elsevier, 1990

work page 1990
[18]

Plangen: Towards unified layout planning and image generation in auto-regressive vision language models, 2025

Runze He, Bo Cheng, Yuhang Ma, Qingxiang Jia, Shanyuan Liu, Ao Ma, Xiaoyu Wu, Liebucha Wu, Dawei Leng, and Yuhui Yin. Plangen: Towards unified layout planning and image generation in auto-regressive vision language models, 2025. URLhttps://arxiv.org/abs/2503.10127

work page arXiv 2025
[19]

Re-align: Structured reasoning-guided alignment for in-context image generation and editing.arXiv preprint arXiv:2601.05124, 2026

Runze He, Yiji Cheng, Tiankai Hang, Zhimin Li, Yu Xu, Zijin Yin, Shiyi Zhang, Wenxun Dai, Penghui Du, Ao Ma, et al. Re-align: Structured reasoning-guided alignment for in-context image generation and editing.arXiv preprint arXiv:2601.05124, 2026. 11

work page arXiv 2026
[20]

Dual-schedule inversion: Training-and tuning-free inversion for real image editing

Jiancheng Huang, Yi Huang, Jianzhuang Liu, Donghao Zhou, Yifan Liu, and Shifeng Chen. Dual-schedule inversion: Training-and tuning-free inversion for real image editing. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 660–669. IEEE, 2025

work page 2025
[21]

Diffusion model-based image editing: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Yi Huang, Jiancheng Huang, Yifan Liu, Mingfu Yan, Jiaxi Lv, Jianzhuang Liu, Wei Xiong, He Zhang, Liangliang Cao, and Shifeng Chen. Diffusion model-based image editing: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[22]

Springer, 2005

Bernd Jähne.Digital image processing. Springer, 2005

work page 2005
[23]

Laion-aesthetics-predictor v1, 2022

LAION-AI. Laion-aesthetics-predictor v1, 2022. URLhttps://github.com/LAION-AI/aesthetic-predictor

work page 2022
[24]

Ic-custom: Diverse image customization via in-context learning.arXiv preprint arXiv:2507.01926, 2025

Yaowei Li, Xiaoyu Li, Zhaoyang Zhang, Yuxuan Bian, Gan Liu, Xinyuan Li, Jiale Xu, Wenbo Hu, Yating Liu, Lingen Li, et al. Ic-custom: Diverse image customization via in-context learning.arXiv preprint arXiv:2507.01926, 2025

work page arXiv 2025
[25]

Pair-id: A dual modal framework for identity preserving image generation.IEEE Signal Processing Letters, 2024

Jingyu Lin, Yongrong Wu, Zeyu Wang, Xiaode Liu, and Yufei Guo. Pair-id: A dual modal framework for identity preserving image generation.IEEE Signal Processing Letters, 2024

work page 2024
[26]

Difftv: Identity-preserved thermal-to-visible face translation via feature alignment and dual-stage conditions

Jingyu Lin, Guiqin Zhao, Jing Xu, Guoli Wang, Zejin Wang, Antitza Dantcheva, Lan Du, and Cunjian Chen. Difftv: Identity-preserved thermal-to-visible face translation via feature alignment and dual-stage conditions. In ACM Multimedia 2024, 2024

work page 2024
[27]

Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent

Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding, Wenbo Li, et al. Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent. arXiv preprint arXiv:2506.17612, 2025

work page arXiv 2025
[28]

Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, et al. Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

work page arXiv 2025
[29]

Mofu: Scale-aware modulation and fourier fusion for multi-subject video generation.arXiv preprint arXiv:2512.22310, 2025

Run Ling, Ke Cao, Jian Lu, Ao Ma, Haowei Liu, Runze He, Changwei Wang, Rongtao Xu, Yihua Shao, Zhanjie Zhang, et al. Mofu: Scale-aware modulation and fourier fusion for multi-subject video generation.arXiv preprint arXiv:2512.22310, 2025

work page arXiv 2025
[30]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Lay2story: extending diffusion transformers for layout-togglable story generation

Ao Ma, Jiasong Feng, Ke Cao, Jing Wang, Yun Wang, Quanwei Zhang, and Zhanjie Zhang. Lay2story: extending diffusion transformers for layout-togglable story generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16102–16111, 2025

work page 2025
[32]

Hd-painter: high-resolution and prompt-faithful text-guided image inpainting with diffusion models

Hayk Manukyan, Andranik Sargsyan, Barsegh Atanyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Hd-painter: high-resolution and prompt-faithful text-guided image inpainting with diffusion models. InThe Thirteenth International Conference on Learning Representations, 2023

work page 2023
[33]

Ace++: Instruction-based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025

Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction-based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025

work page arXiv 2025
[34]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[35]

Mirrorgan: Learning text-to-image generation by redescription

Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. Mirrorgan: Learning text-to-image generation by redescription. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1505–1514, 2019

work page 2019
[36]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[37]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021. 12

work page 2021
[38]

Generative adversarial text to image synthesis

Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. InInternational conference on machine learning, pages 1060–1069. PMLR, 2016

work page 2016
[39]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022

work page 2022
[40]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

work page 2023
[41]

Palette: Image-to-image diffusion models

Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. InACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022

work page 2022
[42]

Sg-adapter: Enhancing text-to-image generation with scene graph guidance.arXiv preprint arXiv:2405.15321, 2024

Guibao Shen, Luozhou Wang, Jiantao Lin, Wenhang Ge, Chaozhe Zhang, Xin Tao, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Guangyong Chen, et al. Sg-adapter: Enhancing text-to-image generation with scene graph guidance.arXiv preprint arXiv:2405.15321, 2024

work page arXiv 2024
[43]

Scenedecorator: Towards scene-oriented story generation with scene planning and scene consistency.arXiv preprint arXiv:2510.22994, 2025

Quanjian Song, Donghao Zhou, Jingyu Lin, Fei Shen, Jiaze Wang, Xiaowei Hu, Cunjian Chen, and Pheng-Ann Heng. Scenedecorator: Towards scene-oriented story generation with scene planning and scene consistency.arXiv preprint arXiv:2510.22994, 2025

work page arXiv 2025
[44]

Insert anything: Image insertion via in-context editing in dit.arXiv preprint arXiv:2504.15009, 2025

Wensong Song, Hong Jiang, Zongxing Yang, Ruijie Quan, and Yi Yang. Insert anything: Image insertion via in-context editing in dit.arXiv preprint arXiv:2504.15009, 2025

work page arXiv 2025
[45]

Ominicontrol: Minimal and universal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024

work page arXiv 2024
[46]

Yolov8, 2023

ultralytics. Yolov8, 2023. URLhttps://github.com/ultralytics/ultralytics

work page 2023
[47]

Diffx: Guide your layout to cross-modal generative modeling.arXiv preprint arXiv:2407.15488, 2024

Zeyu Wang, Jingyu Lin, Yifei Qian, Yi Huang, Shicen Tian, Bosong Chai, Juncan Deng, Lan Du, Cunjian Chen, Yufei Guo, et al. Diffx: Guide your layout to cross-modal generative modeling.arXiv preprint arXiv:2407.15488, 2024

work page arXiv 2024
[48]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

work page 2004
[49]

Personalized image generation with deep generative models: A decade survey.arXiv preprint arXiv:2502.13081, 2025

Yuxiang Wei, Yiheng Zheng, Yabo Zhang, Ming Liu, Zhilong Ji, Lei Zhang, and Wangmeng Zuo. Personalized image generation with deep generative models: A decade survey.arXiv preprint arXiv:2502.13081, 2025

work page arXiv 2025
[50]

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Smartbrush: Text and shape guided object inpainting with diffusion model

Shaoan Xie, Zhifei Zhang, Zhe Lin, Tobias Hinz, and Kun Zhang. Smartbrush: Text and shape guided object inpainting with diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22428–22437, 2023

work page 2023
[52]

Attngan: Fine-grained text to image generation with attentional generative adversarial networks

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018

work page 2018
[53]

Paint by example: Exemplar-based image editing with diffusion models

Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18381–18391, 2023

work page 2023
[54]

Uni-paint: A unified framework for multimodal image inpainting with pretrained diffusion model

Shiyuan Yang, Xiaodong Chen, and Jing Liao. Uni-paint: A unified framework for multimodal image inpainting with pretrained diffusion model. InProceedings of the 31st ACM International Conference on Multimedia, pages 3190–3199, 2023

work page 2023
[55]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022. 13

work page internal anchor Pith review Pith/arXiv arXiv 2022
[56]

Trade-offs in image generation: How do different dimensions interact?arXiv preprint arXiv:2507.22100, 2025

Sicheng Zhang, Binzhu Xie, Zhonghao Yan, Yuli Zhang, Donghao Zhou, Xiaofei Chen, Shi Qiu, Jiaqi Liu, Guoyang Xie, and Zhichao Lu. Trade-offs in image generation: How do different dimensions interact?arXiv preprint arXiv:2507.22100, 2025

work page arXiv 2025
[57]

Artbank: Artistic style transfer with pre-trained diffusion model and implicit style prompt bank

Zhanjie Zhang, Quanwei Zhang, Wei Xing, Guangyuan Li, Lei Zhao, Jiakai Sun, Zehua Lan, Junsheng Luan, Yiling Huang, and Huaizhong Lin. Artbank: Artistic style transfer with pre-trained diffusion model and implicit style prompt bank. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 7396–7404, 2024

work page 2024
[58]

U-stydit: Ultra-high quality artistic style transfer using diffusion transformers.arXiv preprint arXiv:2503.08157, 2025

Zhanjie Zhang, Ao Ma, Ke Cao, Jing Wang, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, and Yuhui Yin. U-stydit: Ultra-high quality artistic style transfer using diffusion transformers.arXiv preprint arXiv:2503.08157, 2025

work page arXiv 2025
[59]

Objectadd: adding objects into image via a training-free diffusion modification fashion.arXiv preprint arXiv:2404.17230, 2024

Ziyue Zhang, Quanjian Song, Yuxin Zhang, and Rongrong Ji. Objectadd: adding objects into image via a training-free diffusion modification fashion.arXiv preprint arXiv:2404.17230, 2024

work page arXiv 2024
[60]

Magictailor: Component-controllable personalization in text-to-image diffusion models.arXiv preprint arXiv:2410.13370, 2024

Donghao Zhou, Jiancheng Huang, Jinbin Bai, Jiaze Wang, Hao Chen, Guangyong Chen, Xiaowei Hu, and Pheng-Ann Heng. Magictailor: Component-controllable personalization in text-to-image diffusion models.arXiv preprint arXiv:2410.13370, 2024

work page arXiv 2024
[61]

Change the object in the black square to the product in the left image

Donghao Zhou, Jingyu Lin, Guibao Shen, Quande Liu, Jialin Gao, Lihao Liu, Lan Du, Cunjian Chen, Chi-Wing Fu, Xiaowei Hu, et al. Identitystory: Taming your identity-preserving generator for human-centric story generation. arXiv preprint arXiv:2512.23519, 2025. 14 A HP-Image-40K Dataset Statistics To better demonstrate the diversity and comprehensiveness of...

work page arXiv 2025