Recognition: 2 theorem links
· Lean TheoremHiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images
Pith reviewed 2026-05-15 17:22 UTC · model grok-4.3
The pith
HiFi-Inpaint uses shared enhancement attention and detail-aware loss to generate high-fidelity human-product images that preserve product details.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HiFi-Inpaint is a reference-based inpainting framework that introduces Shared Enhancement Attention (SEA) to refine fine-grained product features and Detail-Aware Loss (DAL) to enforce precise pixel-level supervision using high-frequency maps, trained on the automatically curated HP-Image-40K dataset, and achieves state-of-the-art performance in generating detail-preserving human-product images.
What carries the argument
Shared Enhancement Attention (SEA) combined with Detail-Aware Loss (DAL), which together refine product features from references and supply high-frequency pixel supervision during inpainting.
If this is right
- The generated images retain higher visual fidelity for use in advertising and e-commerce.
- The framework outperforms prior reference-based inpainting models on metrics of detail preservation.
- The HP-Image-40K dataset supports more effective training for product-specific image synthesis.
- High-frequency map supervision leads to better retention of small-scale product features.
Where Pith is reading between the lines
- The automatic curation pipeline could be adapted to build specialized datasets for other narrow-domain generation tasks.
- The approach may combine with diffusion models to further increase realism while keeping reference guidance.
- Similar attention and loss designs could apply to detail-critical editing in medical or technical imagery.
Load-bearing premise
The assumption that the new attention module and loss, trained on the HP-Image-40K dataset, will produce generalizable improvements in product detail fidelity beyond the specific training distribution.
What would settle it
A set of test images from a different product domain or real-world capture where the model visibly distorts or loses fine textures and logos would show the improvements do not generalize.
read the original abstract
Human-product images, which showcase the integration of humans and products, play a vital role in advertising, e-commerce, and digital marketing. The essential challenge of generating such images lies in ensuring the high-fidelity preservation of product details. Among existing paradigms, reference-based inpainting offers a targeted solution by leveraging product reference images to guide the inpainting process. However, limitations remain in three key aspects: the lack of diverse large-scale training data, the struggle of current models to focus on product detail preservation, and the inability of coarse supervision for achieving precise guidance. To address these issues, we propose HiFi-Inpaint, a novel high-fidelity reference-based inpainting framework tailored for generating human-product images. HiFi-Inpaint introduces Shared Enhancement Attention (SEA) to refine fine-grained product features and Detail-Aware Loss (DAL) to enforce precise pixel-level supervision using high-frequency maps. Additionally, we construct a new dataset, HP-Image-40K, with samples curated from self-synthesis data and processed with automatic filtering. Experimental results show that HiFi-Inpaint achieves state-of-the-art performance, delivering detail-preserving human-product images.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes HiFi-Inpaint, a reference-based inpainting framework for high-fidelity human-product image generation. It introduces Shared Enhancement Attention (SEA) to refine fine-grained product features, Detail-Aware Loss (DAL) for pixel-level supervision via high-frequency maps, and the automatically curated HP-Image-40K dataset. The central claim is that these components enable state-of-the-art detail-preserving results in human-product composites for applications such as e-commerce.
Significance. If the experimental claims hold with rigorous validation, the work could advance practical inpainting for advertising and retail imagery by improving product detail fidelity. The dataset contribution and the two new modules (SEA and DAL) represent targeted engineering advances, though their generalizability beyond the training distribution remains to be demonstrated.
major comments (3)
- [Experimental Results] Experimental Results section: The abstract asserts SOTA performance, yet the provided text supplies no quantitative metrics (e.g., PSNR, SSIM, LPIPS, FID), baseline comparisons, ablation studies, or error analysis. Without these, the data-to-claim link for the superiority of SEA + DAL + HP-Image-40K cannot be evaluated.
- [Method] Method section (description of DAL): The Detail-Aware Loss is defined using high-frequency maps, but no derivation or ablation shows that this formulation is necessary versus standard perceptual or reconstruction losses; the load-bearing claim that it enforces 'precise pixel-level supervision' requires explicit quantitative support.
- [Dataset] Dataset section: The HP-Image-40K curation process is described as automatic filtering of self-synthesis data, but no statistics on filtering criteria, diversity metrics, or train/test split details are given, undermining claims of improved generalizability.
minor comments (2)
- [Method] Notation for SEA is introduced without a clear equation or diagram reference; add a formal definition or pseudocode in §3.
- [Figures] Figure captions for qualitative results should include the specific baselines shown and the metric values for each example.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We agree that the current manuscript version requires strengthening in the experimental validation, justification of the Detail-Aware Loss, and dataset documentation. We will revise the paper to address each point with additional quantitative evidence, ablations, and statistics while preserving the core contributions of SEA, DAL, and the HP-Image-40K dataset.
read point-by-point responses
-
Referee: [Experimental Results] Experimental Results section: The abstract asserts SOTA performance, yet the provided text supplies no quantitative metrics (e.g., PSNR, SSIM, LPIPS, FID), baseline comparisons, ablation studies, or error analysis. Without these, the data-to-claim link for the superiority of SEA + DAL + HP-Image-40K cannot be evaluated.
Authors: We acknowledge the omission in the submitted version. The revised manuscript will expand the Experimental Results section to include full quantitative tables reporting PSNR, SSIM, LPIPS, and FID scores against multiple reference-based inpainting baselines. We will also present ablation studies isolating the contributions of SEA and DAL, along with error analysis on failure cases. These additions will explicitly link the metrics to the superiority claims for the combined SEA + DAL + HP-Image-40K pipeline. revision: yes
-
Referee: [Method] Method section (description of DAL): The Detail-Aware Loss is defined using high-frequency maps, but no derivation or ablation shows that this formulation is necessary versus standard perceptual or reconstruction losses; the load-bearing claim that it enforces 'precise pixel-level supervision' requires explicit quantitative support.
Authors: We will add a short derivation subsection explaining the high-frequency map construction and its motivation for pixel-level detail preservation. The revised version will include a dedicated ablation table comparing DAL against standard L1, perceptual (VGG), and adversarial losses, reporting quantitative gains in high-frequency fidelity metrics. This will provide the requested empirical support for the claim of precise supervision. revision: yes
-
Referee: [Dataset] Dataset section: The HP-Image-40K curation process is described as automatic filtering of self-synthesis data, but no statistics on filtering criteria, diversity metrics, or train/test split details are given, undermining claims of improved generalizability.
Authors: We will expand the Dataset section with explicit statistics: filtering thresholds (e.g., minimum resolution, product visibility scores), diversity metrics (pose variance, product category distribution, background complexity), and the exact train/validation/test split ratios. These details will be accompanied by a brief analysis of how the curation improves coverage relative to prior datasets, directly supporting the generalizability claims. revision: yes
Circularity Check
No significant circularity
full rationale
The paper proposes new architectural modules (SEA, DAL) and a curated dataset (HP-Image-40K) whose performance claims rest on experimental comparisons rather than any closed-form derivation, fitted parameter renamed as prediction, or self-citation chain. No equations are presented that reduce the claimed fidelity gains to quantities defined by the inputs themselves; the central improvements are introduced as novel and evaluated externally.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard deep-learning training assumptions for conditional image generation hold for the reference-based inpainting task.
invented entities (2)
-
Shared Enhancement Attention (SEA)
no independent evidence
-
Detail-Aware Loss (DAL)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HiFi-Inpaint introduces Shared Enhancement Attention (SEA) to refine fine-grained product features and Detail-Aware Loss (DAL) to enforce precise pixel-level supervision using high-frequency maps.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We design a high-frequency map-guided DiT framework that employs a token merging mechanism
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
Reference graph
Works this paper leans on
-
[1]
Blended latent diffusion.ACM transactions on graphics (TOG), 42(4):1–11, 2023
Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion.ACM transactions on graphics (TOG), 42(4):1–11, 2023
work page 2023
-
[2]
Integrating view conditions for image synthesis.arXiv preprint arXiv:2310.16002, 2023
Jinbin Bai, Zhen Dong, Aosong Feng, Xiao Zhang, Tian Ye, and Kaicheng Zhou. Integrating view conditions for image synthesis.arXiv preprint arXiv:2310.16002, 2023
-
[3]
Jinbin Bai, Wei Chow, Ling Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, and Shuicheng Yan. Humanedit: A high-quality human-rewarded dataset for instruction-based image editing.arXiv preprint arXiv:2412.04280, 2024
-
[4]
Jinbin Bai, Tian Ye, Wei Chow, Enxin Song, Qing-Guo Chen, Xiangtai Li, Zhen Dong, Lei Zhu, and Shuicheng Yan. Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image synthesis. InThe Thirteenth International Conference on Learning Representations, 2024
work page 2024
-
[5]
Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. Image inpainting. InProceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 417–424, 2000
work page 2000
-
[6]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[7]
Black Forest Labs. Flux. https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev, 2025
work page 2025
-
[8]
Diffusion self-distillation for zero-shot customized image generation
Shengqu Cai, Eric Ryan Chan, Yunzhi Zhang, Leonidas Guibas, Jiajun Wu, and Gordon Wetzstein. Diffusion self-distillation for zero-shot customized image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18434–18443, 2025
work page 2025
-
[9]
Ke Cao, Jing Wang, Ao Ma, Jiasong Feng, Zhanjie Zhang, Xuanhua He, Shanyuan Liu, Bo Cheng, Dawei Leng, Yuhui Yin, et al. Relactrl: Relevance-guided efficient control for diffusion transformers.arXiv preprint arXiv:2502.14377, 2025
-
[10]
Pu Cao, Feng Zhou, Qing Song, and Lu Yang. Controllable generation with text-to-image diffusion models: A survey.arXiv preprint arXiv:2403.04279, 2024
-
[11]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021
work page 2021
-
[12]
An empirical study of gpt-4o image generation capabilities.arXiv preprint arXiv:2504.05979, 2025
Sixiang Chen, Jinbin Bai, Zhuoran Zhao, Tian Ye, Qingyu Shi, Donghao Zhou, Wenhao Chai, Xin Lin, Jianzong Wu, Chao Tang, et al. An empirical study of gpt-4o image generation capabilities.arXiv preprint arXiv:2504.05979, 2025
-
[13]
Postercraft: Rethinking high-quality aesthetic poster generation in a unified framework
SiXiang Chen, Jianyu Lai, Jialin Gao, Tian Ye, Haoyu Chen, Hengyu Shi, Shitong Shao, Yunlong Lin, Song Fei, Zhaohu Xing, et al. Postercraft: Rethinking high-quality aesthetic poster generation in a unified framework. arXiv preprint arXiv:2506.10741, 2025
-
[14]
Sixiang Chen, Jianyu Lai, Jialin Gao, Hengyu Shi, Zhongying Liu, Tian Ye, Junfeng Luo, Xiaoming Wei, and Lei Zhu. Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026
-
[15]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024
work page 2024
-
[16]
Antonio Criminisi, Patrick Pérez, and Kentaro Toyama. Region filling and object removal by exemplar-based image inpainting.IEEE Transactions on image processing, 13(9):1200–1212, 2004
work page 2004
-
[17]
Generalized and separable sobel operators
Per-Erik Danielsson and Olle Seger. Generalized and separable sobel operators. InMachine vision for three- dimensional scenes, pages 347–379. Elsevier, 1990
work page 1990
-
[18]
Runze He, Bo Cheng, Yuhang Ma, Qingxiang Jia, Shanyuan Liu, Ao Ma, Xiaoyu Wu, Liebucha Wu, Dawei Leng, and Yuhui Yin. Plangen: Towards unified layout planning and image generation in auto-regressive vision language models, 2025. URLhttps://arxiv.org/abs/2503.10127
-
[19]
Runze He, Yiji Cheng, Tiankai Hang, Zhimin Li, Yu Xu, Zijin Yin, Shiyi Zhang, Wenxun Dai, Penghui Du, Ao Ma, et al. Re-align: Structured reasoning-guided alignment for in-context image generation and editing.arXiv preprint arXiv:2601.05124, 2026. 11
-
[20]
Dual-schedule inversion: Training-and tuning-free inversion for real image editing
Jiancheng Huang, Yi Huang, Jianzhuang Liu, Donghao Zhou, Yifan Liu, and Shifeng Chen. Dual-schedule inversion: Training-and tuning-free inversion for real image editing. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 660–669. IEEE, 2025
work page 2025
-
[21]
Yi Huang, Jiancheng Huang, Yifan Liu, Mingfu Yan, Jiaxi Lv, Jianzhuang Liu, Wei Xiong, He Zhang, Liangliang Cao, and Shifeng Chen. Diffusion model-based image editing: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
- [22]
-
[23]
Laion-aesthetics-predictor v1, 2022
LAION-AI. Laion-aesthetics-predictor v1, 2022. URLhttps://github.com/LAION-AI/aesthetic-predictor
work page 2022
-
[24]
Ic-custom: Diverse image customization via in-context learning.arXiv preprint arXiv:2507.01926, 2025
Yaowei Li, Xiaoyu Li, Zhaoyang Zhang, Yuxuan Bian, Gan Liu, Xinyuan Li, Jiale Xu, Wenbo Hu, Yating Liu, Lingen Li, et al. Ic-custom: Diverse image customization via in-context learning.arXiv preprint arXiv:2507.01926, 2025
-
[25]
Jingyu Lin, Yongrong Wu, Zeyu Wang, Xiaode Liu, and Yufei Guo. Pair-id: A dual modal framework for identity preserving image generation.IEEE Signal Processing Letters, 2024
work page 2024
-
[26]
Jingyu Lin, Guiqin Zhao, Jing Xu, Guoli Wang, Zejin Wang, Antitza Dantcheva, Lan Du, and Cunjian Chen. Difftv: Identity-preserved thermal-to-visible face translation via feature alignment and dual-stage conditions. In ACM Multimedia 2024, 2024
work page 2024
-
[27]
Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent
Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding, Wenbo Li, et al. Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent. arXiv preprint arXiv:2506.17612, 2025
-
[28]
Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, et al. Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025
-
[29]
Run Ling, Ke Cao, Jian Lu, Ao Ma, Haowei Liu, Runze He, Changwei Wang, Rongtao Xu, Yihua Shao, Zhanjie Zhang, et al. Mofu: Scale-aware modulation and fourier fusion for multi-subject video generation.arXiv preprint arXiv:2512.22310, 2025
-
[30]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
Lay2story: extending diffusion transformers for layout-togglable story generation
Ao Ma, Jiasong Feng, Ke Cao, Jing Wang, Yun Wang, Quanwei Zhang, and Zhanjie Zhang. Lay2story: extending diffusion transformers for layout-togglable story generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16102–16111, 2025
work page 2025
-
[32]
Hd-painter: high-resolution and prompt-faithful text-guided image inpainting with diffusion models
Hayk Manukyan, Andranik Sargsyan, Barsegh Atanyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Hd-painter: high-resolution and prompt-faithful text-guided image inpainting with diffusion models. InThe Thirteenth International Conference on Learning Representations, 2023
work page 2023
-
[33]
Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction-based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025
-
[34]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[35]
Mirrorgan: Learning text-to-image generation by redescription
Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. Mirrorgan: Learning text-to-image generation by redescription. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1505–1514, 2019
work page 2019
-
[36]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[37]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021. 12
work page 2021
-
[38]
Generative adversarial text to image synthesis
Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. InInternational conference on machine learning, pages 1060–1069. PMLR, 2016
work page 2016
-
[39]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022
work page 2022
-
[40]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023
work page 2023
-
[41]
Palette: Image-to-image diffusion models
Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. InACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022
work page 2022
-
[42]
Guibao Shen, Luozhou Wang, Jiantao Lin, Wenhang Ge, Chaozhe Zhang, Xin Tao, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Guangyong Chen, et al. Sg-adapter: Enhancing text-to-image generation with scene graph guidance.arXiv preprint arXiv:2405.15321, 2024
-
[43]
Quanjian Song, Donghao Zhou, Jingyu Lin, Fei Shen, Jiaze Wang, Xiaowei Hu, Cunjian Chen, and Pheng-Ann Heng. Scenedecorator: Towards scene-oriented story generation with scene planning and scene consistency.arXiv preprint arXiv:2510.22994, 2025
-
[44]
Insert anything: Image insertion via in-context editing in dit.arXiv preprint arXiv:2504.15009, 2025
Wensong Song, Hong Jiang, Zongxing Yang, Ruijie Quan, and Yi Yang. Insert anything: Image insertion via in-context editing in dit.arXiv preprint arXiv:2504.15009, 2025
-
[45]
Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024
-
[46]
ultralytics. Yolov8, 2023. URLhttps://github.com/ultralytics/ultralytics
work page 2023
-
[47]
Diffx: Guide your layout to cross-modal generative modeling.arXiv preprint arXiv:2407.15488, 2024
Zeyu Wang, Jingyu Lin, Yifei Qian, Yi Huang, Shicen Tian, Bosong Chai, Juncan Deng, Lan Du, Cunjian Chen, Yufei Guo, et al. Diffx: Guide your layout to cross-modal generative modeling.arXiv preprint arXiv:2407.15488, 2024
-
[48]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004
work page 2004
-
[49]
Yuxiang Wei, Yiheng Zheng, Yabo Zhang, Ming Liu, Zhilong Ji, Lei Zhang, and Wangmeng Zuo. Personalized image generation with deep generative models: A decade survey.arXiv preprint arXiv:2502.13081, 2025
-
[50]
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Smartbrush: Text and shape guided object inpainting with diffusion model
Shaoan Xie, Zhifei Zhang, Zhe Lin, Tobias Hinz, and Kun Zhang. Smartbrush: Text and shape guided object inpainting with diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22428–22437, 2023
work page 2023
-
[52]
Attngan: Fine-grained text to image generation with attentional generative adversarial networks
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018
work page 2018
-
[53]
Paint by example: Exemplar-based image editing with diffusion models
Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18381–18391, 2023
work page 2023
-
[54]
Uni-paint: A unified framework for multimodal image inpainting with pretrained diffusion model
Shiyuan Yang, Xiaodong Chen, and Jing Liao. Uni-paint: A unified framework for multimodal image inpainting with pretrained diffusion model. InProceedings of the 31st ACM International Conference on Multimedia, pages 3190–3199, 2023
work page 2023
-
[55]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022. 13
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[56]
Sicheng Zhang, Binzhu Xie, Zhonghao Yan, Yuli Zhang, Donghao Zhou, Xiaofei Chen, Shi Qiu, Jiaqi Liu, Guoyang Xie, and Zhichao Lu. Trade-offs in image generation: How do different dimensions interact?arXiv preprint arXiv:2507.22100, 2025
-
[57]
Artbank: Artistic style transfer with pre-trained diffusion model and implicit style prompt bank
Zhanjie Zhang, Quanwei Zhang, Wei Xing, Guangyuan Li, Lei Zhao, Jiakai Sun, Zehua Lan, Junsheng Luan, Yiling Huang, and Huaizhong Lin. Artbank: Artistic style transfer with pre-trained diffusion model and implicit style prompt bank. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 7396–7404, 2024
work page 2024
-
[58]
Zhanjie Zhang, Ao Ma, Ke Cao, Jing Wang, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, and Yuhui Yin. U-stydit: Ultra-high quality artistic style transfer using diffusion transformers.arXiv preprint arXiv:2503.08157, 2025
-
[59]
Ziyue Zhang, Quanjian Song, Yuxin Zhang, and Rongrong Ji. Objectadd: adding objects into image via a training-free diffusion modification fashion.arXiv preprint arXiv:2404.17230, 2024
-
[60]
Donghao Zhou, Jiancheng Huang, Jinbin Bai, Jiaze Wang, Hao Chen, Guangyong Chen, Xiaowei Hu, and Pheng-Ann Heng. Magictailor: Component-controllable personalization in text-to-image diffusion models.arXiv preprint arXiv:2410.13370, 2024
-
[61]
Change the object in the black square to the product in the left image
Donghao Zhou, Jingyu Lin, Guibao Shen, Quande Liu, Jialin Gao, Lihao Liu, Lan Du, Cunjian Chen, Chi-Wing Fu, Xiaowei Hu, et al. Identitystory: Taming your identity-preserving generator for human-centric story generation. arXiv preprint arXiv:2512.23519, 2025. 14 A HP-Image-40K Dataset Statistics To better demonstrate the diversity and comprehensiveness of...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.