Recognition: unknown
AHS: Adaptive Head Synthesis via Synthetic Data Augmentations
Pith reviewed 2026-05-10 08:58 UTC · model grok-4.3
The pith
Adaptive Head Synthesis (AHS) employs head-reenacted synthetic data augmentation to enable robust head swapping on full upper-body images without paired training data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AHS achieves superior performance in challenging real-world scenarios, producing visually coherent results that preserve identity and expression fidelity across various head orientations and hairstyles. Notably, AHS shows exceptional robustness in maintaining facial identity while drastic expression changes and faithfully preserving accessories while significant head pose variations.
Load-bearing premise
The novel head reenacted synthetic data augmentation strategy overcomes self-supervised training constraints and enhances generalization across diverse facial expressions and orientations without requiring paired training data.
Figures
read the original abstract
Recent digital media advancements have created increasing demands for sophisticated portrait manipulation techniques, particularly head swapping, where one's head is seamlessly integrated with another's body. However, current approaches predominantly rely on face-centered cropped data with limited view angles, significantly restricting their real-world applicability. They struggle with diverse head expressions, varying hairstyles, and natural blending beyond facial regions. To address these limitations, we propose Adaptive Head Synthesis (AHS), which effectively handles full upper-body images with varied head poses and expressions. AHS incorporates a novel head reenacted synthetic data augmentation strategy to overcome self-supervised training constraints, enhancing generalization across diverse facial expressions and orientations without requiring paired training data. Comprehensive experiments demonstrate that AHS achieves superior performance in challenging real-world scenarios, producing visually coherent results that preserve identity and expression fidelity across various head orientations and hairstyles. Notably, AHS shows exceptional robustness in maintaining facial identity while drastic expression changes and faithfully preserving accessories while significant head pose variations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Adaptive Head Synthesis (AHS) for head swapping on full upper-body images. It introduces a novel head-reenacted synthetic data augmentation strategy to enable self-supervised training without paired data, claiming improved generalization to diverse poses, expressions, hairstyles, and better preservation of identity and accessories compared to prior face-centered methods.
Significance. If the central claims hold, the work would advance portrait manipulation by reducing dependence on paired training data and extending applicability beyond cropped faces. The synthetic augmentation approach is a potential strength for overcoming self-supervised limitations, provided the domain gap to real images is demonstrably small.
major comments (2)
- [Abstract] Abstract: The central claims of 'superior performance' and 'exceptional robustness' in real-world scenarios (identity preservation under drastic expression/pose changes, accessory fidelity) are asserted without any quantitative metrics, baseline comparisons, ablation results, or failure-case analysis. This absence is load-bearing because the abstract provides no evidence that the synthetic augmentation produces a distribution close enough to real full-upper-body images for the claimed generalization to follow.
- [Method] Method section (head reenactment pipeline): The novel synthetic data augmentation is presented as overcoming self-supervised constraints and enhancing generalization without paired data. However, no evidence is given that the reenactment mechanism adequately models real-world variations in hair dynamics, accessory occlusion, or lighting interactions; without such validation or filtering of reenactment failures, the self-supervised objective may optimize for an easier synthetic distribution, undermining the real-world robustness claim.
minor comments (1)
- [Abstract] The abstract could be strengthened by including at least one key quantitative result or baseline name to allow readers to gauge the magnitude of improvement.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, providing clarifications based on the manuscript content and indicating revisions where the presentation can be strengthened without misrepresenting our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of 'superior performance' and 'exceptional robustness' in real-world scenarios (identity preservation under drastic expression/pose changes, accessory fidelity) are asserted without any quantitative metrics, baseline comparisons, ablation results, or failure-case analysis. This absence is load-bearing because the abstract provides no evidence that the synthetic augmentation produces a distribution close enough to real full-upper-body images for the claimed generalization to follow.
Authors: The abstract serves as a high-level summary of findings detailed in the Experiments section, which includes quantitative metrics, baseline comparisons, ablation studies, and visual results demonstrating generalization. We agree the abstract could better signal this support. We have revised it to moderate phrasing (e.g., 'improved performance' and 'robustness') and added a brief reference to key experimental outcomes showing the synthetic data's effectiveness in bridging to real images. revision: yes
-
Referee: [Method] Method section (head reenactment pipeline): The novel synthetic data augmentation is presented as overcoming self-supervised constraints and enhancing generalization without paired data. However, no evidence is given that the reenactment mechanism adequately models real-world variations in hair dynamics, accessory occlusion, or lighting interactions; without such validation or filtering of reenactment failures, the self-supervised objective may optimize for an easier synthetic distribution, undermining the real-world robustness claim.
Authors: The reenactment pipeline leverages established techniques, with overall effectiveness validated indirectly through superior real-image results in experiments. We acknowledge the value of more direct analysis on hair, occlusion, and lighting. We have added discussion in the revised Method section and supplementary material on domain gap mitigation, including example reenactment outputs and a failure-case analysis to show where variations are handled or limited. revision: yes
Circularity Check
No significant circularity; empirical method proposal with no derivation chain
full rationale
The paper proposes Adaptive Head Synthesis (AHS) as a new technique using novel head-reenacted synthetic data augmentation for full upper-body head swapping. No equations, first-principles derivations, or predictions are present that could reduce to inputs by construction. The approach is described as an independent methodological contribution to overcome self-supervised constraints without paired data, supported by experiments rather than any fitted-parameter renaming, self-definitional loops, or load-bearing self-citations. The central claims rest on the augmentation strategy's design and empirical results, which are self-contained and do not invoke uniqueness theorems or ansatzes from prior author work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Realistic and efficient face swapping: A unified approach with diffusion models
Sanoojan Baliah, Qinliang Lin, Shengcai Liao, Xiaodan Liang, and Muhammad Haris Khan. Realistic and efficient face swapping: A unified approach with diffusion models. arXiv preprint arXiv:2409.07269, 2024. 2, 3, 6, 12
-
[3]
In- structpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023. 3
2023
-
[4]
Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021. 3
2021
-
[5]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023. 3
work page internal anchor Pith review arXiv 2023
-
[6]
Simswap: An efficient framework for high fidelity face swapping
Renwang Chen, Xuanhong Chen, Bingbing Ni, and Yanhao Ge. Simswap: An efficient framework for high fidelity face swapping. InProceedings of the 28th ACM international conference on multimedia, pages 2003–2011, 2020. 2
2003
-
[7]
Anydoor: Zero-shot object-level im- age customization
Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level im- age customization. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6593–6602, 2024. 3
2024
-
[8]
Viton-hd: High-resolution virtual try-on via misalignment-aware normalization
Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14131–14140, 2021. 4
2021
-
[9]
Improving diffusion models for au- thentic virtual try-on in the wild
Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for au- thentic virtual try-on in the wild. InEuropean Conference on Computer Vision, pages 206–235. Springer, 2024. 3, 4
2024
-
[10]
Xception: Deep learning with depthwise separable convolutions
Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 1251–1258, 2017. 12
2017
-
[11]
Generalizable and ani- matable gaussian head avatar.Advances in Neural Informa- tion Processing Systems, 37:57642–57670, 2025
Xuangeng Chu and Tatsuya Harada. Generalizable and ani- matable gaussian head avatar.Advances in Neural Informa- tion Processing Systems, 37:57642–57670, 2025. 2, 4, 8, 12
2025
-
[12]
Chaeyeon Chung, Sunghyun Park, Jeongho Kim, and Jaegul Choo. What to preserve and what to transfer: Faithful, identity-preserving diffusion-based hairstyle transfer.arXiv preprint arXiv:2408.16450, 2024. 3
-
[13]
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 12, 13
2025
-
[14]
Emoca: Emotion driven monocular face capture and animation
Radek Dan ˇeˇcek, Michael J Black, and Timo Bolkart. Emoca: Emotion driven monocular face capture and animation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 20311–20322, 2022. 3, 5
2022
-
[15]
Arcface: Additive angular margin loss for deep face recognition
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4690–4699, 2019. 6
2019
-
[16]
Stylegan-human: A data-centric odyssey of human genera- tion.arXiv preprint, arXiv:2204.11823, 2022
Jianglin Fu, Shikai Li, Yuming Jiang, Kwan-Yee Lin, Chen Qian, Chen-Change Loy, Wayne Wu, and Ziwei Liu. Stylegan-human: A data-centric odyssey of human genera- tion.arXiv preprint, arXiv:2204.11823, 2022. 5, 12
-
[17]
Information bottleneck disentanglement for identity swapping
Gege Gao, Huaibo Huang, Chaoyou Fu, Zhaoyang Li, and Ran He. Information bottleneck disentanglement for identity swapping. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3404–3413,
-
[18]
Ghost 2.0: generative high-fidelity one shot transfer of heads.arXiv preprint arXiv:2502.18417, 2025
Alexander Groshev, Anastasiia Iashchenko, Pavel Para- monov, Denis Dimitrov, and Andrey Kuznetsov. Ghost 2.0: generative high-fidelity one shot transfer of heads.arXiv preprint arXiv:2502.18417, 2025. 3, 12
-
[19]
Densepose: Dense human pose estimation in the wild
Rıza Alp G ¨uler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7297–7306, 2018. 2
2018
-
[20]
A generalist facex via learning unified facial representation.arXiv preprint arXiv:2401.00551, 2023
Yue Han, Jiangning Zhang, Junwei Zhu, Xiangtai Li, Yan- hao Ge, Wei Li, Chengjie Wang, Yong Liu, Xiaoming Liu, and Ying Tai. A generalist facex via learning unified facial representation.arXiv preprint arXiv:2401.00551, 2023. 2, 3, 12
-
[21]
Gans trained by a two time-scale update rule converge to a local nash equilib- rium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InAdvances in Neural Information Processing Sys- tems. Curran Associates, Inc., 2017. 6
2017
-
[22]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 12
work page internal anchor Pith review arXiv 2022
-
[23]
Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3
2020
-
[24]
Animate anyone: Consistent and controllable image- to-video synthesis for character animation
Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024. 3
2024
-
[25]
Zero-shot head swapping in real-world scenarios
Taewoong Kang, Sohyun Jeong, Hyojin Jang, and Jaegul Choo. Zero-shot head swapping in real-world scenarios. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 10805–10814, 2025. 2, 3, 6, 12
2025
-
[26]
Imagic: Text-based real image editing with diffusion models
Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023. 3
2023
-
[27]
Sapiens: Foundation for human vision mod- els
Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vision mod- els. InEuropean Conference on Computer Vision, pages 206–228. Springer, 2024. 6
2024
-
[28]
Stableviton: Learning semantic correspon- dence with latent diffusion model for virtual try-on
Jeongho Kim, Guojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo. Stableviton: Learning semantic correspon- dence with latent diffusion model for virtual try-on. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8176–8185, 2024. 3
2024
-
[29]
Tcan: Animating human images with temporally consistent pose guidance using diffusion models
Jeongho Kim, Min-Jung Kim, Junsoo Lee, and Jaegul Choo. Tcan: Animating human images with temporally consistent pose guidance using diffusion models. InEuropean Confer- ence on Computer Vision, pages 326–342. Springer, 2024. 3
2024
-
[30]
Kangyeol Kim, Sunghyun Park, Junsoo Lee, and Jaegul Choo. Reference-based image composition with sketch via structure-aware diffusion model.arXiv preprint arXiv:2304.09748, 2023. 3
-
[31]
Selfswapper: Self-supervised face swapping via shape ag- nostic masked autoencoder
Jaeseong Lee, Junha Hyung, Sohyun Jung, and Jaegul Choo. Selfswapper: Self-supervised face swapping via shape ag- nostic masked autoencoder. InEuropean Conference on Computer Vision, pages 383–400. Springer, 2024. 2, 4
2024
-
[32]
Self- correction for human parsing, 2019
Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. Self- correction for human parsing, 2019. 6
2019
-
[33]
Learning a model of facial shape and expression from 4d scans.ACM Trans
Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans.ACM Trans. Graph., 36(6):194–1, 2017. 5, 6
2017
-
[34]
Gligen: Open-set grounded text-to-image generation
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023. 3
2023
-
[35]
Photomaker: Customizing re- alistic human photos via stacked id embedding
Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- Ming Cheng, and Ying Shan. Photomaker: Customizing re- alistic human photos via stacked id embedding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8640–8650, 2024. 3, 5, 12
2024
-
[36]
Common diffusion noise schedules and sample steps are flawed
Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5404–5411, 2024. 5
2024
-
[37]
Decoupled weight de- cay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 12
2019
-
[38]
T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models
Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4296–4304, 2024. 3
2024
-
[39]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,
-
[40]
Deepfacelab: Integrated, flex- ible and extensible face-swapping framework, 2021
Ivan Perov, Daiheng Gao, Nikolay Chervoniy, Kunlin Liu, Sugasa Marangonda, Chris Um ´e, Dpfks, Carl Shift Facen- heim, Luis RP, Jian Jiang, Sheng Zhang, Pingyu Wu, Bo Zhou, and Weiming Zhang. Deepfacelab: Integrated, flex- ible and extensible face-swapping framework, 2021. 2, 3, 12
2021
-
[41]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 6, 12
work page internal anchor Pith review arXiv 2023
-
[42]
Malte Prinzler, Egor Zakharov, Vanessa Sklyarova, Berna Kabadayi, and Justus Thies. Joker: Conditional 3d head synthesis with extreme facial expressions.arXiv preprint arXiv:2410.16395, 2024. 3
-
[43]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 6
2021
-
[44]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3
work page internal anchor Pith review arXiv 2022
-
[46]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3
2022
-
[47]
Nataniel Ruiz, Eunji Chong, and James M. Rehg. Fine- grained head pose estimation without keypoints. InThe IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR) Workshops, 2018. 6
2018
-
[48]
Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 3
2022
-
[49]
Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural In- formation Processing Systems, 35:25278–25294, 2022
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural In- formation Processing Systems, 35:25278–25294, 2022. 3
2022
-
[50]
Few-shot head swapping in the wild, 2022
Changyong Shu, Hemao Wu, Hang Zhou, Jiaming Liu, Zhibin Hong, Changxing Ding, Junyu Han, Jingtuo Liu, Er- rui Ding, and Jingdong Wang. Few-shot head swapping in the wild, 2022. 2, 3, 12, 13
2022
-
[51]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[52]
Stable Diffusion XL 1.0 Inpainting 0.1.https : / / huggingface
Stability AI and Hugging Face. Stable Diffusion XL 1.0 Inpainting 0.1.https : / / huggingface . co / diffusers / stable - diffusion - xl - 1 . 0 - inpainting-0.1, 2023. Accessed: 2025-07-28. 12
2023
-
[53]
Hs-diffusion: Semantic-mixing diffusion for head swapping, 2023
Qinghe Wang, Lijie Liu, Miao Hua, Pengfei Zhu, Wangmeng Zuo, Qinghua Hu, Huchuan Lu, and Bing Cao. Hs-diffusion: Semantic-mixing diffusion for head swapping, 2023. 3, 12
2023
-
[54]
Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024
Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024. 6
-
[55]
Qwen-image technical report,
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...
-
[56]
Paint by example: Exemplar-based image editing with diffusion mod- els
Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion mod- els. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 18381–18391,
-
[57]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,
work page internal anchor Pith review arXiv
-
[58]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 2, 3, 6
2023
-
[59]
Scal- ing in-the-wild training for diffusion-based illumination har- monization and editing by imposing consistent light trans- port
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Scal- ing in-the-wild training for diffusion-based illumination har- monization and editing by imposing consistent light trans- port. InThe Thirteenth International Conference on Learn- ing Representations, 2025. 8, 14
2025
-
[60]
Tryondiffusion: A tale of two unets
Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, and Ira Kemelmacher-Shlizerman. Tryondiffusion: A tale of two unets. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 4606–4615,
-
[61]
Additional Related Work 6.1. Head Swap Many existing methods, such as those proposed in [2, 20, 40], optimize their approaches based on these cropped datasets, leading to inherent limitations in handling cases where the full head or surrounding region should be harmo- nized with the body. Consequently, these methods strug- gle with occlusions, head orient...
-
[62]
We train our model on the SHHQ dataset [16], adopting the data handling proce- dures from HID [25] with modified captions as detailed in Sec
Implementation Details Our model is composed of three key components: the H- Net, which utilizes an SDXL inpainting model [52]; the S- Net, which employs the UNet from the original SDXL [41]; and a pretrained IP-Adapter [57] and a pretrained face encoder from PhotoMaker [35]. We train our model on the SHHQ dataset [16], adopting the data handling proce- d...
-
[63]
Additional Experiments 8.1. Comparisons with Additional Baselines We further compare our AHS with four additional baselines: Nano Banana [13], Qwen-Image-Edit [55], HeSer [50], and Ghost 2.0 [18]. As shown in Fig. 10 and Tab. 3, while Nano Banana and Qwen-Image-Edit prioritize consistency, they often produce images identical to the input or suffer from se...
-
[64]
Failure Cases Despite robust normal estimation in profile views, Fig. 14, our method faces three main challenges: (1) identity preser- vation under extreme poses, (2) restoration of masked- out facial occlusions, and (3) maintaining consistent facial scales when aligning with the body geometry. These cases arise from the inherent difficulty of hallucinati...
-
[65]
15 and Fig
Additional Qualitative Results We provide additional qualitative results in Fig. 15 and Fig. 16 generated by our proposed approach. Figure 12.Results of IC-light Augmentation. Figure 13.Inference mask results. Figure 14.Failure Cases. Figure 15.Qualitative comparison.The images in theHeadcolumn are combined with those in theBodycolumn. The last four colum...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.