arxiv: 2604.15857 · v1 · submitted 2026-04-17 · 💻 cs.CV

Recognition: unknown

AHS: Adaptive Head Synthesis via Synthetic Data Augmentations

Taewoong Kang , Hyojin Jang , Sohyun Jeong , Seunggi Moon , Gihwi Kim , Hoon Jin Jung , Jaegul Choo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords headdataexpressionsfacialacrossadaptivediverseexpression

0 comments

The pith

Adaptive Head Synthesis (AHS) employs head-reenacted synthetic data augmentation to enable robust head swapping on full upper-body images without paired training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current head-swapping tools work best on close-up face photos but fail when the head turns, hair changes, or the full upper body is visible. AHS creates extra training examples by taking a person's head, reenacting it in new poses and expressions synthetically, and pasting it onto different bodies. This lets the model learn from more varied examples even though real paired before-and-after photos are hard to get. The result is claimed to keep the person's identity and accessories intact even during big expression or pose changes.

Core claim

AHS achieves superior performance in challenging real-world scenarios, producing visually coherent results that preserve identity and expression fidelity across various head orientations and hairstyles. Notably, AHS shows exceptional robustness in maintaining facial identity while drastic expression changes and faithfully preserving accessories while significant head pose variations.

Load-bearing premise

The novel head reenacted synthetic data augmentation strategy overcomes self-supervised training constraints and enhances generalization across diverse facial expressions and orientations without requiring paired training data.

Figures

Figures reproduced from arXiv: 2604.15857 by Gihwi Kim, Hoon Jin Jung, Hyojin Jang, Jaegul Choo, Seunggi Moon, Sohyun Jeong, Taewoong Kang.

**Figure 1.** Figure 1: Head-swapped results comparison among the baselines. Our model outperforms on preserving identity, hairstyle and accessaries while reeancting target body image’s expression and head pose. Abstract Recent digital media advancements have created increasing demands for sophisticated portrait manipulation techniques, particularly head swapping, where one’s head is seamlessly integrated with another’s body. … view at source ↗

**Figure 2.** Figure 2: Problem definition of head swapping. The first row indicates the portion of the head from the source image that needs to be transferred, while the second row indicates the portion of the head in the target image that should be preserved. video data to extract training samples. Additionally, it commonly consists of two separate models, one for reenactment and another for blending, further increasing compu… view at source ↗

**Figure 3.** Figure 3: Overview of training AHS. Our model encodes identity using dedicated Head and Face Encoders, while H-Net preserves finegrained head details. To prevent reconstruction artifacts and improve robustness, the training process is enhanced with GAGAvatar, which generates augmented data with diverse head poses and expressions. following [9], to provide low-level features. Specifically, we extract key-value pairs… view at source ↗

**Figure 6.** Figure 6: Inference overview. Our model takes source and target images and outputs head swapped results within an unified model. of both Is and It through EMOCA [14], we substitute the shape parameter of Is with the corresponding parameters of It before generating the normal map. \label {eq:ca} \text {FLAME}(\boldsymbol {\beta }_{s}, \boldsymbol {\theta }_{t}, \boldsymbol {\psi }_{t}) \rightarrow (\textbf {V}, \text… view at source ↗

**Figure 5.** Figure 5: Mask augmentation strategy. Including dilation, widened bounding box creation, and merging with a random mask. 3.3. Inference Our proposed AHS enables efficient inference by leveraging its head reenactment-based training regime. Simply providing the target body and source head images as input allows the model to generate a head-reenacted and swapped result, as illustrated in [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 7.** Figure 7: Qualitative comparison. The images in the Head column are combined with those in the Body column. The last four columns are the head-swapped results produced by each method. More results can be found in Sec. 10 [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Results of user study. Models CLIP-I (Head) ↑ FID ↓ ID sim ↑ Head orientation ↓ Expression ↓ w/o GAG Aug 0.7575 10.94 0.8492 17.34 6.8542 w/o Mask Aug 0.8637 8.99 0.4098 7.27 5.6719 w/o head encoder 0.8912 9.29 0.5690 7.90 8.8247 w/o face, head encoder 0.9019 9.66 0.6411 8.92 8.4452 w/o decoupled CA 0.8050 9.04 0.5124 8.27 8.7458 AHS (Ours) 0.9122 8.98 0.6237 8.72 6.4273 [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 9.** Figure 9: Ablation study qualitative results. Without the head and face encoders, identity preservation degrades under large head pose differences. Without augmentation, self-supervised learning causes copy-and-paste artifacts. User Study. To validate the effectiveness of our method, AHS, we conduct a multiple-alternative forced-choice user study on 20 image pairs involving 19 participants. Each participant is asked… view at source ↗

**Figure 10.** Figure 10: Qualitative comparison with additional baselines [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative results of additional ablation study. Method CLIP-I(Head) ↑ FID ↓ FID(Crop) ↓ ID sim↑ Head orientation↓ Expression ↓ Ghost 2.0 0.8487 7.968 19.749 0.483 4.831 10.626 HeSer [50] 0.8507 6.444 15.221 0.503 8.734 11.022 Nano Banana [13] 0.8634 4.241 2.809 0.474 10.725 7.449 Qwen-Image-Edit [55] 0.8789 4.377 4.422 0.536 15.504 7.522 Ours 0.9139 9.613 6.719 0.625 8.427 6.887 Ours w/o normal 0.9238 1… view at source ↗

**Figure 12.** Figure 12: Results of IC-light Augmentation [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: , the intermediate result, while plausible, exhibits clothing distortion. In contrast, the final output is cleanly reconstructed because the refined mask correctly excludes the clothing area from the inpainting process. 9. Failure Cases Despite robust normal estimation in profile views, [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Failure Cases [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative comparison. The images in the Head column are combined with those in the Body column. The last four columns are the head-swapped results produced by each method [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative comparison. The images in the Head column are combined with those in the Body column. The last four columns are the head-swapped results produced by each method [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

read the original abstract

Recent digital media advancements have created increasing demands for sophisticated portrait manipulation techniques, particularly head swapping, where one's head is seamlessly integrated with another's body. However, current approaches predominantly rely on face-centered cropped data with limited view angles, significantly restricting their real-world applicability. They struggle with diverse head expressions, varying hairstyles, and natural blending beyond facial regions. To address these limitations, we propose Adaptive Head Synthesis (AHS), which effectively handles full upper-body images with varied head poses and expressions. AHS incorporates a novel head reenacted synthetic data augmentation strategy to overcome self-supervised training constraints, enhancing generalization across diverse facial expressions and orientations without requiring paired training data. Comprehensive experiments demonstrate that AHS achieves superior performance in challenging real-world scenarios, producing visually coherent results that preserve identity and expression fidelity across various head orientations and hairstyles. Notably, AHS shows exceptional robustness in maintaining facial identity while drastic expression changes and faithfully preserving accessories while significant head pose variations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AHS adds synthetic reenactment augmentations to push head swapping onto full upper-body images without paired data, but the performance edge stays unproven without numbers or failure analysis.

read the letter

The paper's core move is to generate synthetic training pairs via head reenactment and then train an adaptive synthesis model on full upper-body shots. This lets it skip the usual paired-data requirement and target bigger pose swings, expressions, hairstyles, and accessories that break most face-crop methods. That framing is new enough in the head-manipulation literature to count as a distinct contribution rather than a straight restatement of prior self-supervised work.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Adaptive Head Synthesis (AHS) for head swapping on full upper-body images. It introduces a novel head-reenacted synthetic data augmentation strategy to enable self-supervised training without paired data, claiming improved generalization to diverse poses, expressions, hairstyles, and better preservation of identity and accessories compared to prior face-centered methods.

Significance. If the central claims hold, the work would advance portrait manipulation by reducing dependence on paired training data and extending applicability beyond cropped faces. The synthetic augmentation approach is a potential strength for overcoming self-supervised limitations, provided the domain gap to real images is demonstrably small.

major comments (2)

[Abstract] Abstract: The central claims of 'superior performance' and 'exceptional robustness' in real-world scenarios (identity preservation under drastic expression/pose changes, accessory fidelity) are asserted without any quantitative metrics, baseline comparisons, ablation results, or failure-case analysis. This absence is load-bearing because the abstract provides no evidence that the synthetic augmentation produces a distribution close enough to real full-upper-body images for the claimed generalization to follow.
[Method] Method section (head reenactment pipeline): The novel synthetic data augmentation is presented as overcoming self-supervised constraints and enhancing generalization without paired data. However, no evidence is given that the reenactment mechanism adequately models real-world variations in hair dynamics, accessory occlusion, or lighting interactions; without such validation or filtering of reenactment failures, the self-supervised objective may optimize for an easier synthetic distribution, undermining the real-world robustness claim.

minor comments (1)

[Abstract] The abstract could be strengthened by including at least one key quantitative result or baseline name to allow readers to gauge the magnitude of improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, providing clarifications based on the manuscript content and indicating revisions where the presentation can be strengthened without misrepresenting our results.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 'superior performance' and 'exceptional robustness' in real-world scenarios (identity preservation under drastic expression/pose changes, accessory fidelity) are asserted without any quantitative metrics, baseline comparisons, ablation results, or failure-case analysis. This absence is load-bearing because the abstract provides no evidence that the synthetic augmentation produces a distribution close enough to real full-upper-body images for the claimed generalization to follow.

Authors: The abstract serves as a high-level summary of findings detailed in the Experiments section, which includes quantitative metrics, baseline comparisons, ablation studies, and visual results demonstrating generalization. We agree the abstract could better signal this support. We have revised it to moderate phrasing (e.g., 'improved performance' and 'robustness') and added a brief reference to key experimental outcomes showing the synthetic data's effectiveness in bridging to real images. revision: yes
Referee: [Method] Method section (head reenactment pipeline): The novel synthetic data augmentation is presented as overcoming self-supervised constraints and enhancing generalization without paired data. However, no evidence is given that the reenactment mechanism adequately models real-world variations in hair dynamics, accessory occlusion, or lighting interactions; without such validation or filtering of reenactment failures, the self-supervised objective may optimize for an easier synthetic distribution, undermining the real-world robustness claim.

Authors: The reenactment pipeline leverages established techniques, with overall effectiveness validated indirectly through superior real-image results in experiments. We acknowledge the value of more direct analysis on hair, occlusion, and lighting. We have added discussion in the revised Method section and supplementary material on domain gap mitigation, including example reenactment outputs and a failure-case analysis to show where variations are handled or limited. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method proposal with no derivation chain

full rationale

The paper proposes Adaptive Head Synthesis (AHS) as a new technique using novel head-reenacted synthetic data augmentation for full upper-body head swapping. No equations, first-principles derivations, or predictions are present that could reduce to inputs by construction. The approach is described as an independent methodological contribution to overcome self-supervised constraints without paired data, supported by experiments rather than any fitted-parameter renaming, self-definitional loops, or load-bearing self-citations. The central claims rest on the augmentation strategy's design and empirical results, which are self-contained and do not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method implicitly assumes synthetic reenactments are sufficiently realistic to substitute for real paired data.

pith-pipeline@v0.9.0 · 5484 in / 1057 out tokens · 28480 ms · 2026-05-10T08:58:15.744023+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 16 canonical work pages · 7 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Realistic and efficient face swapping: A unified approach with diffusion models

Sanoojan Baliah, Qinliang Lin, Shengcai Liao, Xiaodan Liang, and Muhammad Haris Khan. Realistic and efficient face swapping: A unified approach with diffusion models. arXiv preprint arXiv:2409.07269, 2024. 2, 3, 6, 12

work page arXiv 2024
[3]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023. 3

2023
[4]

Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021. 3

2021
[5]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023. 3

work page internal anchor Pith review arXiv 2023
[6]

Simswap: An efficient framework for high fidelity face swapping

Renwang Chen, Xuanhong Chen, Bingbing Ni, and Yanhao Ge. Simswap: An efficient framework for high fidelity face swapping. InProceedings of the 28th ACM international conference on multimedia, pages 2003–2011, 2020. 2

2003
[7]

Anydoor: Zero-shot object-level im- age customization

Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level im- age customization. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6593–6602, 2024. 3

2024
[8]

Viton-hd: High-resolution virtual try-on via misalignment-aware normalization

Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14131–14140, 2021. 4

2021
[9]

Improving diffusion models for au- thentic virtual try-on in the wild

Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for au- thentic virtual try-on in the wild. InEuropean Conference on Computer Vision, pages 206–235. Springer, 2024. 3, 4

2024
[10]

Xception: Deep learning with depthwise separable convolutions

Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 1251–1258, 2017. 12

2017
[11]

Generalizable and ani- matable gaussian head avatar.Advances in Neural Informa- tion Processing Systems, 37:57642–57670, 2025

Xuangeng Chu and Tatsuya Harada. Generalizable and ani- matable gaussian head avatar.Advances in Neural Informa- tion Processing Systems, 37:57642–57670, 2025. 2, 4, 8, 12

2025
[12]

What to preserve and what to transfer: Faithful, identity-preserving diffusion-based hairstyle transfer.arXiv preprint arXiv:2408.16450, 2024

Chaeyeon Chung, Sunghyun Park, Jeongho Kim, and Jaegul Choo. What to preserve and what to transfer: Faithful, identity-preserving diffusion-based hairstyle transfer.arXiv preprint arXiv:2408.16450, 2024. 3

work page arXiv 2024
[13]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 12, 13

2025
[14]

Emoca: Emotion driven monocular face capture and animation

Radek Dan ˇeˇcek, Michael J Black, and Timo Bolkart. Emoca: Emotion driven monocular face capture and animation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 20311–20322, 2022. 3, 5

2022
[15]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4690–4699, 2019. 6

2019
[16]

Stylegan-human: A data-centric odyssey of human genera- tion.arXiv preprint, arXiv:2204.11823, 2022

Jianglin Fu, Shikai Li, Yuming Jiang, Kwan-Yee Lin, Chen Qian, Chen-Change Loy, Wayne Wu, and Ziwei Liu. Stylegan-human: A data-centric odyssey of human genera- tion.arXiv preprint, arXiv:2204.11823, 2022. 5, 12

work page arXiv 2022
[17]

Information bottleneck disentanglement for identity swapping

Gege Gao, Huaibo Huang, Chaoyou Fu, Zhaoyang Li, and Ran He. Information bottleneck disentanglement for identity swapping. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3404–3413,
[18]

Ghost 2.0: generative high-fidelity one shot transfer of heads.arXiv preprint arXiv:2502.18417, 2025

Alexander Groshev, Anastasiia Iashchenko, Pavel Para- monov, Denis Dimitrov, and Andrey Kuznetsov. Ghost 2.0: generative high-fidelity one shot transfer of heads.arXiv preprint arXiv:2502.18417, 2025. 3, 12

work page arXiv 2025
[19]

Densepose: Dense human pose estimation in the wild

Rıza Alp G ¨uler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7297–7306, 2018. 2

2018
[20]

A generalist facex via learning unified facial representation.arXiv preprint arXiv:2401.00551, 2023

Yue Han, Jiangning Zhang, Junwei Zhu, Xiangtai Li, Yan- hao Ge, Wei Li, Chengjie Wang, Yong Liu, Xiaoming Liu, and Ying Tai. A generalist facex via learning unified facial representation.arXiv preprint arXiv:2401.00551, 2023. 2, 3, 12

work page arXiv 2023
[21]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InAdvances in Neural Information Processing Sys- tems. Curran Associates, Inc., 2017. 6

2017
[22]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 12

work page internal anchor Pith review arXiv 2022
[23]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

2020
[24]

Animate anyone: Consistent and controllable image- to-video synthesis for character animation

Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024. 3

2024
[25]

Zero-shot head swapping in real-world scenarios

Taewoong Kang, Sohyun Jeong, Hyojin Jang, and Jaegul Choo. Zero-shot head swapping in real-world scenarios. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 10805–10814, 2025. 2, 3, 6, 12

2025
[26]

Imagic: Text-based real image editing with diffusion models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023. 3

2023
[27]

Sapiens: Foundation for human vision mod- els

Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vision mod- els. InEuropean Conference on Computer Vision, pages 206–228. Springer, 2024. 6

2024
[28]

Stableviton: Learning semantic correspon- dence with latent diffusion model for virtual try-on

Jeongho Kim, Guojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo. Stableviton: Learning semantic correspon- dence with latent diffusion model for virtual try-on. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8176–8185, 2024. 3

2024
[29]

Tcan: Animating human images with temporally consistent pose guidance using diffusion models

Jeongho Kim, Min-Jung Kim, Junsoo Lee, and Jaegul Choo. Tcan: Animating human images with temporally consistent pose guidance using diffusion models. InEuropean Confer- ence on Computer Vision, pages 326–342. Springer, 2024. 3

2024
[30]

Reference-based image composition with sketch via structure-aware diffusion model.arXiv preprint arXiv:2304.09748, 2023

Kangyeol Kim, Sunghyun Park, Junsoo Lee, and Jaegul Choo. Reference-based image composition with sketch via structure-aware diffusion model.arXiv preprint arXiv:2304.09748, 2023. 3

work page arXiv 2023
[31]

Selfswapper: Self-supervised face swapping via shape ag- nostic masked autoencoder

Jaeseong Lee, Junha Hyung, Sohyun Jung, and Jaegul Choo. Selfswapper: Self-supervised face swapping via shape ag- nostic masked autoencoder. InEuropean Conference on Computer Vision, pages 383–400. Springer, 2024. 2, 4

2024
[32]

Self- correction for human parsing, 2019

Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. Self- correction for human parsing, 2019. 6

2019
[33]

Learning a model of facial shape and expression from 4d scans.ACM Trans

Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans.ACM Trans. Graph., 36(6):194–1, 2017. 5, 6

2017
[34]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023. 3

2023
[35]

Photomaker: Customizing re- alistic human photos via stacked id embedding

Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- Ming Cheng, and Ying Shan. Photomaker: Customizing re- alistic human photos via stacked id embedding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8640–8650, 2024. 3, 5, 12

2024
[36]

Common diffusion noise schedules and sample steps are flawed

Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5404–5411, 2024. 5

2024
[37]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 12

2019
[38]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4296–4304, 2024. 3

2024
[39]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,
[40]

Deepfacelab: Integrated, flex- ible and extensible face-swapping framework, 2021

Ivan Perov, Daiheng Gao, Nikolay Chervoniy, Kunlin Liu, Sugasa Marangonda, Chris Um ´e, Dpfks, Carl Shift Facen- heim, Luis RP, Jian Jiang, Sheng Zhang, Pingyu Wu, Bo Zhou, and Weiming Zhang. Deepfacelab: Integrated, flex- ible and extensible face-swapping framework, 2021. 2, 3, 12

2021
[41]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 6, 12

work page internal anchor Pith review arXiv 2023
[42]

Joker: Conditional 3d head synthesis with extreme facial expressions.arXiv preprint arXiv:2410.16395, 2024

Malte Prinzler, Egor Zakharov, Vanessa Sklyarova, Berna Kabadayi, and Justus Thies. Joker: Conditional 3d head synthesis with extreme facial expressions.arXiv preprint arXiv:2410.16395, 2024. 3

work page arXiv 2024
[43]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 6

2021
[44]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3

work page internal anchor Pith review arXiv 2022
[46]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

2022
[47]

Nataniel Ruiz, Eunji Chong, and James M. Rehg. Fine- grained head pose estimation without keypoints. InThe IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR) Workshops, 2018. 6

2018
[48]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 3

2022
[49]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural In- formation Processing Systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural In- formation Processing Systems, 35:25278–25294, 2022. 3

2022
[50]

Few-shot head swapping in the wild, 2022

Changyong Shu, Hemao Wu, Hang Zhou, Jiaming Liu, Zhibin Hong, Changxing Ding, Junyu Han, Jingtuo Liu, Er- rui Ding, and Jingdong Wang. Few-shot head swapping in the wild, 2022. 2, 3, 12, 13

2022
[51]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[52]

Stable Diffusion XL 1.0 Inpainting 0.1.https : / / huggingface

Stability AI and Hugging Face. Stable Diffusion XL 1.0 Inpainting 0.1.https : / / huggingface . co / diffusers / stable - diffusion - xl - 1 . 0 - inpainting-0.1, 2023. Accessed: 2025-07-28. 12

2023
[53]

Hs-diffusion: Semantic-mixing diffusion for head swapping, 2023

Qinghe Wang, Lijie Liu, Miao Hua, Pengfei Zhu, Wangmeng Zuo, Qinghua Hu, Huchuan Lu, and Bing Cao. Hs-diffusion: Semantic-mixing diffusion for head swapping, 2023. 3, 12

2023
[54]

Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024

Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024. 6

work page arXiv 2024
[55]

Qwen-image technical report,

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...
[56]

Paint by example: Exemplar-based image editing with diffusion mod- els

Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion mod- els. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 18381–18391,
[57]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

work page internal anchor Pith review arXiv
[58]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 2, 3, 6

2023
[59]

Scal- ing in-the-wild training for diffusion-based illumination har- monization and editing by imposing consistent light trans- port

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Scal- ing in-the-wild training for diffusion-based illumination har- monization and editing by imposing consistent light trans- port. InThe Thirteenth International Conference on Learn- ing Representations, 2025. 8, 14

2025
[60]

Tryondiffusion: A tale of two unets

Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, and Ira Kemelmacher-Shlizerman. Tryondiffusion: A tale of two unets. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 4606–4615,
[61]

Additional Related Work 6.1. Head Swap Many existing methods, such as those proposed in [2, 20, 40], optimize their approaches based on these cropped datasets, leading to inherent limitations in handling cases where the full head or surrounding region should be harmo- nized with the body. Consequently, these methods strug- gle with occlusions, head orient...
[62]

We train our model on the SHHQ dataset [16], adopting the data handling proce- dures from HID [25] with modified captions as detailed in Sec

Implementation Details Our model is composed of three key components: the H- Net, which utilizes an SDXL inpainting model [52]; the S- Net, which employs the UNet from the original SDXL [41]; and a pretrained IP-Adapter [57] and a pretrained face encoder from PhotoMaker [35]. We train our model on the SHHQ dataset [16], adopting the data handling proce- d...
[63]

Comparisons with Additional Baselines We further compare our AHS with four additional baselines: Nano Banana [13], Qwen-Image-Edit [55], HeSer [50], and Ghost 2.0 [18]

Additional Experiments 8.1. Comparisons with Additional Baselines We further compare our AHS with four additional baselines: Nano Banana [13], Qwen-Image-Edit [55], HeSer [50], and Ghost 2.0 [18]. As shown in Fig. 10 and Tab. 3, while Nano Banana and Qwen-Image-Edit prioritize consistency, they often produce images identical to the input or suffer from se...

work page arXiv
[64]

Failure Cases Despite robust normal estimation in profile views, Fig. 14, our method faces three main challenges: (1) identity preser- vation under extreme poses, (2) restoration of masked- out facial occlusions, and (3) maintaining consistent facial scales when aligning with the body geometry. These cases arise from the inherent difficulty of hallucinati...
[65]

15 and Fig

Additional Qualitative Results We provide additional qualitative results in Fig. 15 and Fig. 16 generated by our proposed approach. Figure 12.Results of IC-light Augmentation. Figure 13.Inference mask results. Figure 14.Failure Cases. Figure 15.Qualitative comparison.The images in theHeadcolumn are combined with those in theBodycolumn. The last four colum...