H-Adapter: Pose-Robust Hairstyle Transfer via Attention-Derived, Source-Aligned Hair Masks

Sanghun Park; Seulgi Jeong; Yunseong Cho

arxiv: 2606.25578 · v1 · pith:JFDD4OT3new · submitted 2026-06-24 · 💻 cs.CV

H-Adapter: Pose-Robust Hairstyle Transfer via Attention-Derived, Source-Aligned Hair Masks

Seulgi Jeong , Yunseong Cho , Sanghun Park This is my paper

Pith reviewed 2026-06-25 21:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords hairstyle transferpose robustnessdiffusion inpaintingcross-attentionimage editingvirtual try-onmask derivation

0 comments

The pith

H-Adapter trains with a region-specific loss to produce source-aligned hair masks from disentangled cross-attention for guiding diffusion inpainting under pose differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces H-Adapter to perform hairstyle transfer when the source image and reference hairstyle come from different head poses. It uses a training loss that separates objectives for hair and non-hair regions, which creates cross-attention maps with clear spatial separation. These maps supply a mask aligned to the source image that directs inpainting in a diffusion model. The method reports stronger quantitative scores on pose-different test cases and extends to text prompts and color control. A VLM-based evaluation also shows gains in faithfulness and fewer artifacts.

Core claim

H-Adapter improves pose robustness by training with a region-specific loss that disentangles hair and non-hair objectives and thereby induces spatially disentangled cross-attention, from which a source-aligned hair edit mask is derived to guide diffusion-based inpainting. Experiments on pose-agnostic and pose-different subsets demonstrate strong quantitative results, including the best FID, FID_CLIP, and CLIP-I under pose differences, while maintaining competitive non-hair preservation and improving qualitative fidelity to fine-grained reference hairstyle details.

What carries the argument

region-specific loss that disentangles hair and non-hair objectives to induce spatially disentangled cross-attention maps, from which source-aligned hair edit masks are derived

If this is right

The method achieves the best FID, FID_CLIP, and CLIP-I scores on pose-different subsets.
Non-hair preservation remains competitive while reference hairstyle details are transferred more faithfully.
H-Adapter extends to text-to-image generation and auxiliary prompt-based hair color control.
It remains compatible with an identity-preserving IP-Adapter variant.
A VLM-as-a-judge protocol shows consistent gains in hairstyle faithfulness, non-hair preservation, and artifact quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attention-disentanglement approach could be tested on other localized edits such as clothing or accessory transfer.
If the masks prove reliable, they might replace manual or segmentation-based masking in other diffusion editing pipelines.
Extreme pose cases beyond the paper's test subsets would provide a direct check on the mask derivation step.

Load-bearing premise

The region-specific loss will reliably produce spatially disentangled cross-attention maps whose derived masks accurately isolate hair regions even under large pose discrepancies between source and reference.

What would settle it

Observe whether the attention-derived masks correctly isolate hair on source images with large pose shifts; if the masks include non-hair regions or miss hair areas and transfer quality drops sharply, the approach does not hold.

Figures

Figures reproduced from arXiv: 2606.25578 by Sanghun Park, Seulgi Jeong, Yunseong Cho.

**Figure 1.** Figure 1: Overview of our method. We train an H-Adapter with a region-specific objective and apply a source-aligned coarse attention mask to localize diffusion inpainting for reference-guided hairstyle transfer. Our approach also leverages attention-derived masks, but targets hair transfer under pose mismatch. Unlike prior work on general edit localization, we derive a source-aligned hair mask from cross-attention … view at source ↗

**Figure 2.** Figure 2: Flexible extensions of H-Adapter. H-Adapter supports auxiliary prompt control (λ; Eq. (4)), tandard text-to-image generation, and composition with identitypreserving adapters. Source-Aligned Coarse Attention Mask A key observation is that the proposed region-specific objective yields cross-attention maps that exhibit separation between hair and non-hair regions [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparisons under head-pose differences. The first two columns show the source and reference images; the remaining columns show outputs from each method. These examples span source–reference yaw gaps from 18.65◦ to 53.21◦ . the Stable Diffusion v1.5 inpainting model, where the hair-removed base images are synthesized using a FLUX.2 [1]. Full implementation details are provided in Appendix B of … view at source ↗

**Figure 4.** Figure 4: Separator-token analysis over the 16 IP-Adapter tokens. For each token, we retain the larger IoU between the mask obtained by excluding that token from attention aggregation and the corresponding token-only mask. The highest mean IoU is achieved by t8 over 3,000 samples, supporting its use as the separator token in our pipeline. non-hair regions of the source and generated images using PSNR and SSIM [39]. … view at source ↗

**Figure 5.** Figure 5: Stage-wise visualization of editing guidance in the proposed inference pipeline, comparing IP-Adapter and H-Adapter (ours). (a) Conditioning inputs: source xs, hair-removed base x˜s, and reference xr. (b) Token-wise cross-attention for 16 tokens t0, . . . , t15: (b-1) the attention map of the separator token ts, with its token index shown in parentheses; and (b-2) the aggregated attention map obtained by s… view at source ↗

**Figure 6.** Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Auxiliary hair-color control with H-Adapter. Inputs and the random seed are fixed. Rows vary hair-color prompts and columns vary λ (0.2–1.0). Smaller λ improves prompt responsiveness, while larger λ strengthens reference faithfulness. As shown in Tab. 2, our method achieves the highest mean on all three axes. While the AQS gap over Style-Your-Hair falls within overlapping confidence intervals, Style-Your-… view at source ↗

**Figure 8.** Figure 8: Qualitative results of H-Adapter combined with IP-Adapter FaceID Plus [43]. The source image conditions FaceID Plus for identity preservation. Rows vary text prompts, and columns vary reference images for H-Adapter, demonstrating referenceguided hair transfer across diverse contexts while preserving identity. conditions without retraining. Additional qualitative results, including in-thewild examples, st… view at source ↗

**Figure 9.** Figure 9: Qualitative results of H-Adapter (Ours) on EasyPortrait [17] and web-crawled images. H-Adapter preserves image quality and reference hairstyle features under diverse, unconstrained conditions. HairFastGAN as a fast encoder-based GAN representative. Earlier optimizationbased GAN methods typically involve costly per-instance optimization, and are therefore not the focus of this runtime table. C Additional … view at source ↗

**Figure 10.** Figure 10: Additional qualitative results on stylized in-the-wild images. The proposed method generalizes well to illustration- and animation-like portraits. may appear plausible on its own, it may not accurately reflect how the bangs extend relative to the forehead or eyebrow region in the reference. These observations suggest that the remaining limitation lies in modeling the fine-grained spatial relationship bet… view at source ↗

**Figure 11.** Figure 11: Observed limitation on in-the-wild reference images from EasyPortrait [17]. In these cases, the generated front hair does not fully preserve the relative placement and facial coverage pattern of the reference. FaceID Plus branch: \label {eq:sup:method:ref-gated-attn-faceid} \begin {aligned} Z &= \mathrm {softmax}\!\left (\frac {QK^{\top }}{\sqrt {d}}\right )V \\ &\quad + \lambda \Bigl (M_t \odot \mathrm {… view at source ↗

**Figure 12.** Figure 12: Reference-guided text-to-image generation with H-Adapter. Given diverse text prompts and reference hairstyle images, H-Adapter generates images that reflect both the textual prompt and the key hairstyle attributes of the reference image. All results are shown with the H-Adapter conditioning scale λ = 0.6. D VLM-as-a-Judge D.1 Prompt Design Overview We evaluate each generated result along three axes: Hair … view at source ↗

**Figure 13.** Figure 13: Compatibility of H-Adapter with IP-Adapter FaceID Plus. Results are generated using a text-to-image pipeline with both H-Adapter and IP-Adapter FaceID Plus. The middle panels compare results w/ non-hair mask and w/o non-hair mask on the FaceID Plus branch across different H-Adapter conditioning scales λ. The rightmost panel visualizes the gating mask MT extracted at the final denoising step T for the exa… view at source ↗

**Figure 14.** Figure 14: Analysis of attention resolution for mask extraction. The leftmost image shows the source image overlaid with the ground-truth hair mask. The remaining columns show, from left to right, the attention heat map and the resulting mask overlaid on the source image, extracted from the 8 × 8, 16 × 16, 32 × 32, and all-resolution averaged cross-attention maps, respectively. The 16 × 16 attention maps provide an … view at source ↗

**Figure 15.** Figure 15: Quantitative comparison of attention resolutions for mask extraction. Evaluated on 3,000 randomly sampled images, the bars show the mean IoU between the attention-derived mask and the ground-truth hair mask, with error bars representing the standard deviation across samples. The 16 × 16 resolution achieves the highest mean IoU, supporting its use for coarse mask extraction. resolutions combined. For cont… view at source ↗

**Figure 16.** Figure 16: Separator-token analysis over the 16 IP-Adapter tokens. For each token, we retain the larger IoU between the mask obtained by excluding that token from attention aggregation and the corresponding token-only mask. The highest mean IoU is achieved by t8 over 3,000 samples, supporting its use as the separator token in our pipeline. construct the coarse mask using the definition above and also evaluate the to… view at source ↗

**Figure 17.** Figure 17: Analysis of separator-token behavior. We visualize the normalized value-vector norms of IP-Adapter and H-Adapter tokens. The t8 token already has a relatively small norm in the pretrained IP-Adapter, and the proposed region-specific objective further reduces its relative norm in H-Adapter. This supports the interpretation that t8 acts as a stable non-hair separator rather than a single-run artifact. G.4 C… view at source ↗

**Figure 18.** Figure 18: Comparison of attention-derived masks and generated results between our method and HairFusion [5]. The initial mask is obtained by thresholding the attentionderived map, while the refined mask denotes the post-processed mask actually used in each pipeline. Our method yields an initial mask that more selectively captures the hair region, making it better suited for spatially localizing hairstyle transfer.… view at source ↗

read the original abstract

Hairstyle transfer has practical applications such as virtual try-on, yet remains challenging when the source and reference exhibit large head-pose discrepancies. We propose H-Adapter, which improves pose robustness by training with a region-specific loss that disentangles hair and non-hair objectives and thereby induces spatially disentangled cross-attention, from which a source-aligned hair edit mask is derived to guide diffusion-based inpainting. Experiments on pose-agnostic and pose-different subsets demonstrate strong quantitative results, including the best FID, $\mathrm{FID}_{\mathrm{CLIP}}$, and CLIP-I under pose differences, while maintaining competitive non-hair preservation and improving qualitative fidelity to fine-grained reference hairstyle details. Beyond source-conditioned transfer, H-Adapter supports practical extensions including text-to-image generation, auxiliary prompt-based hair color control, and compatibility with an identity-preserving IP-Adapter variant. We also introduce a VLM-as-a-judge protocol and observe consistent gains in hairstyle faithfulness, non-hair preservation, and artifact quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

H-Adapter adds a region-specific loss to pull cleaner source-aligned hair masks from cross-attention in diffusion inpainting, with reported gains on pose-different test cases, but the mechanism lacks direct confirmation.

read the letter

The core contribution is a training loss that splits hair versus non-hair objectives so the model's cross-attention maps become more spatially separated, letting the authors derive a mask that stays aligned to the source even when head pose changes. They then use that mask to condition diffusion inpainting. The paper shows this pipeline beats prior numbers on FID, FID_CLIP, and CLIP-I for the pose-mismatched subset while keeping non-hair regions stable.

The quantitative edge on the hard split is the clearest positive result. The extensions to text prompts, color control, and IP-Adapter compatibility are practical and worth noting for anyone building try-on tools. The VLM-as-judge protocol is a reasonable way to get more hairstyle-specific feedback than standard metrics.

The main weakness is the missing link between the loss and the claimed disentanglement. No attention-map figures or overlap scores are described, and there is no ablation that isolates the region-specific loss from other training choices. Without those checks it is difficult to tell whether the mask quality actually comes from the new term or from dataset tuning or prompt engineering. The stress-test worry about large pose gaps therefore still stands until the paper supplies that evidence.

This is a solid incremental paper for the virtual try-on and hair-editing corner of diffusion work. Readers who need a drop-in improvement for pose robustness will get something usable; readers looking for a new theoretical angle on attention will not. The experiments are concrete enough that a serious referee should see it, even if the central claim needs tighter validation on the attention step.

Referee Report

2 major / 0 minor

Summary. The paper proposes H-Adapter for pose-robust hairstyle transfer in diffusion models. It trains with a region-specific loss to disentangle hair and non-hair objectives, inducing spatially disentangled cross-attention maps from which a source-aligned hair edit mask is derived to guide inpainting. Experiments report the best FID, FID_CLIP, and CLIP-I on pose-different subsets while maintaining competitive non-hair preservation; the method also supports text-to-image generation, prompt-based color control, and IP-Adapter compatibility, plus a VLM-as-a-judge evaluation protocol.

Significance. If the region-specific loss reliably produces accurate source-aligned masks under large pose gaps, the approach would offer a practical advance for virtual try-on and editing tasks where pose variation is common, with the reported extensions increasing its applicability.

major comments (2)

[Abstract] Abstract: the claim of best FID / FID_CLIP / CLIP-I on pose-different subsets is presented without any description of subset construction, data splits, or controls for post-hoc selection; this information is load-bearing for interpreting whether the reported gains demonstrate the claimed pose robustness.
[Method / Experiments] The central mechanistic claim—that the region-specific loss produces spatially disentangled cross-attention maps whose derived masks correctly isolate hair even under large pose discrepancies—is not accompanied by attention-map visualizations, overlap metrics with ground-truth hair regions, or an ablation that removes the region-specific loss; without these, it is impossible to confirm that the loss (rather than other diffusion components) is responsible for the observed improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important areas for clarification and additional evidence, which we will address through targeted revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of best FID / FID_CLIP / CLIP-I on pose-different subsets is presented without any description of subset construction, data splits, or controls for post-hoc selection; this information is load-bearing for interpreting whether the reported gains demonstrate the claimed pose robustness.

Authors: We agree that explicit details on subset construction are essential for interpreting the quantitative claims. In the revised manuscript, we will add a dedicated paragraph in the Experiments section describing the construction of the pose-agnostic and pose-different subsets, including the pose-difference thresholds used, the source of the data splits, and steps taken to avoid post-hoc selection. A brief reference to this description will also be added to the abstract. revision: yes
Referee: [Method / Experiments] The central mechanistic claim—that the region-specific loss produces spatially disentangled cross-attention maps whose derived masks correctly isolate hair even under large pose discrepancies—is not accompanied by attention-map visualizations, overlap metrics with ground-truth hair regions, or an ablation that removes the region-specific loss; without these, it is impossible to confirm that the loss (rather than other diffusion components) is responsible for the observed improvements.

Authors: We acknowledge that direct evidence linking the region-specific loss to the disentangled attention maps would strengthen the mechanistic argument. In the revision, we will add cross-attention map visualizations comparing models with and without the region-specific loss. We will also report quantitative overlap metrics (such as IoU) between the derived source-aligned hair masks and ground-truth hair regions on a validation set. Finally, we will include an ablation study showing results for a model variant trained without the region-specific loss. These elements will be placed in the Method and Experiments sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation builds on standard diffusion components without self-referential reductions

full rationale

The paper's central mechanism is a region-specific loss applied during training of a diffusion model to encourage spatially disentangled cross-attention maps, from which a hair mask is then derived for guiding inpainting. No equations, fitted parameters, or self-citations are presented in the provided text that would make any claimed prediction or result equivalent to its inputs by construction. The approach extends existing attention-based conditioning and inpainting techniques with a new loss term whose effect on attention is an empirical outcome rather than a definitional identity. Quantitative results on pose-different subsets are reported as experimental outcomes, not as tautological consequences of the method definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the central mechanism relies on an unstated assumption that the proposed loss produces usable disentangled attention.

pith-pipeline@v0.9.1-grok · 5715 in / 1105 out tokens · 25348 ms · 2026-06-25T21:14:02.805456+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 17 canonical work pages · 10 internal anchors

[1]

Jeong et al

Black Forest Labs: Flux.2 [klein]: Towards interactive visual intelligence,https: 16 S. Jeong et al. //bfl.ai/blog/flux2-klein-towards-interactive-visual-intelligence
[2]

In: Pro- ceedings of the IEEE/CVF international conference on computer vision

Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In: Pro- ceedings of the IEEE/CVF international conference on computer vision. pp. 22560– 22570 (2023)

2023
[3]

In: Forty-first International Conference on Machine Learning (2024)

Chen, D., Chen, R., Zhang, S., Wang, Y., Liu, Y., Zhou, H., Zhang, Q., Wan, Y., Zhou, P., Sun, L.: Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. In: Forty-first International Conference on Machine Learning (2024)

2024
[4]

arXiv preprint arXiv:2206.08585 (2022)

Chung, C., Kim, T., Nam, H., Choi, S., Gu, G., Park, S., Choo, J.: Hairfit: pose- invariant hairstyle transfer via flow-based hair alignment and semantic-region- aware inpainting. arXiv preprint arXiv:2206.08585 (2022)

work page arXiv 2022
[5]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Chung, C., Park, S., Kim, J., Choo, J.: What to preserve and what to transfer: Faithful, identity-preserving diffusion-based hairstyle transfer. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 2582–2590 (2025)

2025
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., et al.: Gemini 2.5: Pushing the frontier with advanced reason- ing, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

arXiv preprint arXiv:2210.11427 (2022)

Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: Diffusion-based seman- tic image editing with mask guidance. arXiv preprint arXiv:2210.11427 (2022)

work page arXiv 2022
[8]

google/models/gemini-image/pro/

Google DeepMind: Gemini 3 pro image – nano banana pro,https://deepmind. google/models/gemini-image/pro/
[9]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Güler, R.A., Neverova, N., Kokkinos, I.: Densepose: Dense human pose estimation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7297–7306 (2018)

2018
[10]

h94: Ip-adapter,https://huggingface.co/h94/IP-Adapter
[11]

h94: Ip-adapter-faceid,https://huggingface.co/h94/IP-Adapter-FaceID
[12]

Prompt-to-Prompt Image Editing with Cross Attention Control

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for im- proved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019)

2019
[15]

In: European conference on computer vision

Kim, T., Chung, C., Kim, Y., Park, S., Kim, K., Choo, J.: Style your hair: Latent optimization for pose-invariant hairstyle transfer via local-style-aware hair align- ment. In: European conference on computer vision. pp. 188–203. Springer (2022)

2022
[16]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Ku, M., Jiang, D., Wei, C., Yue, X., Chen, W.: Viescore: Towards explainable metrics for conditional image synthesis evaluation. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 12268–12290 (2024)

2024
[17]

arXiv preprint arXiv:2304.13509 (2023)

Kvanchiani, K., Petrova, E., Efremyan, K., Sautin, A., Kapitanov, A.: Easyportrait–face parsing and portrait segmentation dataset. arXiv preprint arXiv:2304.13509 (2023)

work page arXiv 2023
[18]

arXiv preprint arXiv:2203.06026 (2022)

Kynkäänniemi, T., Karras, T., Aittala, M., Aila, T., Lehtinen, J.: The role of imagenet classes in fr\’echet inception distance. arXiv preprint arXiv:2203.06026 (2022)

work page arXiv 2022
[19]

LAION: Clip-vit-h-14-laion2b-s32b-b79k,https://huggingface.co/laion/CLIP- ViT-H-14-laion2B-s32B-b79K H-Adapter for Pose-Robust Hairstyle Transfer 17
[20]

In: Findings of the Association for Computational Linguistics: ACL 2024

Lee, S., Kim, S., Park, S., Kim, G., Seo, M.: Prometheus-vision: Vision-language model as a judge for fine-grained evaluation. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 11286–11315 (2024)

2024
[21]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Advances in Neural In- formation Processing Systems37, 45600–45635 (2024)

Nikolaev, M., Kuznetsov, M., Vetrov, D., Alanov, A.: Hairfastgan: Realistic and robust hair transfer with a fast encoder-based approach. Advances in Neural In- formation Processing Systems37, 45600–45635 (2024)

2024
[23]

GPT-4o System Card

OpenAI: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

OpenAI GPT-5 System Card

OpenAI: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

OpenAI: Update to gpt-5 system card: Gpt-5.2.https://openai.com/index/gpt- 5-system-card-update-gpt-5-2/(2025)

2025
[26]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

2021
[28]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

2022
[30]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Saha, R., Duke, B., Shkurti, F., Taylor, G.W., Aarabi, P.: Loho: Latent opti- mization of hairstyles via orthogonalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1984–1993 (2021)

1984
[31]

Advances in neural information processing systems35, 36479–36494 (2022)

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in neural information processing systems35, 36479–36494 (2022)

2022
[32]

co / SG161222 / Realistic_Vision_V4.0_noVAE

SG161222: Realistic vision v4.0 novae,https : / / huggingface . co / SG161222 / Realistic_Vision_V4.0_noVAE
[33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Slyman, E., Tanjim, M., Kafle, K., Lee, S.: Calibrating mllm-as-a-judge via multi- modal bayesian prompt ensembles. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17224–17234 (2025)

2025
[34]

Stability AI: sd-vae-ft-mse,https://huggingface.co/stabilityai/sd-vae-ft- mse
[35]

stabilityai: stable-diffusion-v1-5/stable-diffusion-inpainting,https : / / huggingface.co/stable-diffusion-v1-5/stable-diffusion-inpainting
[36]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep- tion architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2818–2826 (2016)

2016
[37]

Tan, Z., Chai, M., Chen, D., Liao, J., Chu, Q., Yuan, L., Tulyakov, S., Yu, N.: Michigan:multi-input-conditionedhairimagegenerationforportraitediting.arXiv preprint arXiv:2010.16417 (2020)

work page arXiv 2010
[38]

In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition

Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition. pp. 1921–1930 (2023) 18 S. Jeong et al

1921
[39]

IEEE transactions on image processing 13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)

2004
[40]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wei, T., Chen, D., Zhou, W., Liao, J., Tan, Z., Yuan, L., Zhang, W., Yu, N.: Hairclip: Design your hair by text and reference image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18072– 18081 (2022)

2022
[41]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2026)

Wei, T., Chen, D., Zhou, W., Liao, J., Wang, C., Zhang, W., Hua, G., Yu, N.: Unifying multi-modal hair editing via proxy feature blending. IEEE Transactions on Pattern Analysis and Machine Intelligence (2026)

2026
[42]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wei, T., Chen, D., Zhou, W., Liao, J., Zhang, W., Hua, G., Yu, N.: Hairclipv2: Unifying hair editing via proxy feature blending. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23589–23599 (2023)

2023
[43]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

In: Proceedings of the European conference on computer vision (ECCV)

Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Bisenet: Bilateral segmenta- tion network for real-time semantic segmentation. In: Proceedings of the European conference on computer vision (ECCV). pp. 325–341 (2018)

2018
[45]

Advances in Neural Information Processing Systems37, 5048–5073 (2024)

Zeng, Y., Zhang, Y., Jiachen, L., Shen, L., Deng, K., He, W., Wang, J.: Hairdif- fusion: Vivid multi-colored hair editing via latent diffusion. Advances in Neural Information Processing Systems37, 5048–5073 (2024)

2024
[46]

ArXivabs/2311.01361(2023)

Zhang, X., Lu, Y., Wang, W., Yan, A., Yan, J., Qin, L., Wang, H., Yan, X., Wang, W.Y., Petzold, L.R.: Gpt-4v (ision) as a generalist evaluator for vision-language tasks. arXiv preprint arXiv:2311.01361 (2023)

work page arXiv 2023
[47]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Zhang, Y., Zhang, Q., Song, Y., Zhang, J., Tang, H., Liu, J.: Stable-hair: Real- world hair transfer via diffusion model. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 10348–10356 (2025)

2025
[48]

Qwen-Image-2.0 Technical Report

Zhao, B., Wu, C., Li, D., Meng, H., Li, J., Zhang, J., Zhou, J., Lin, J., Gao, K., Cao, K., et al.: Qwen-image-2.0 technical report. arXiv preprint arXiv:2605.10730 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

In: European conference on computer vision

Zhu, H., Wu, W., Zhu, W., Jiang, L., Tang, S., Zhang, L., Liu, Z., Loy, C.C.: Celebv-hq: A large-scale video facial attributes dataset. In: European conference on computer vision. pp. 650–667. Springer (2022)

2022
[50]

arXiv preprint arXiv:2106.01505 (2021)

Zhu, P., Abdal, R., Femiani, J., Wonka, P.: Barbershop: Gan-based image com- positing using segmentation masks. arXiv preprint arXiv:2106.01505 (2021)

work page arXiv 2021
[51]

In: European Conference on Computer Vision

Zhu, P., Abdal, R., Femiani, J., Wonka, P.: Hairnet: Hairstyle transfer with pose changes. In: European Conference on Computer Vision. pp. 651–667. Springer (2022)

2022
[52]

watercolor painting of a man, soft wash, paper texture, gentle gradients, high quality, high detail

Zou, S., Tang, J., Zhou, Y., He, J., Zhao, C., Zhang, R., Hu, Z., Sun, X.: Towards efficientdiffusion-basedimageeditingwithinstantattentionmasks.In:Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 7864–7872 (2024) H-Adapter for Pose-Robust Hairstyle Transfer 19 Supplementary Material A Step-wise Analysis of Region-specific Loss ...

2024
[53]

color and tone/highlights
[54]

texture/curl pattern
[55]

axis": "HFS

silhouette/volume and style identity (including bangs/part) Ignore ALL non-hair factors: - face identity - skin tone/lighting - background/clothes - general aesthetics Score mapping: - 5: Hair attributes are almost fully aligned with REFERENCE. - 4: Mostly aligned with minor inconsistency. - 3: Mixed; clear match in some attributes but clear mismatch in o...
[56]

Determine which hair error tags apply
[57]

axis": "HFS

Judge severity and assign final integer score 1..5. Allowed error tags: - silhouette_mismatch - bangs_part_mismatch - texture_mismatch - color_mismatch - hairline_mismatch - partial_transfer - style_identity_mismatch Scoring guidance from error severity: - 5: No meaningful hair errors. - 4: One minor error. - 3: One or two clear errors (moderate impact). ...
[58]

facial identity geometry (eyes, nose, mouth, contour)
[59]

skin tone and lighting/shadow
[60]

background consistency
[61]

axis": "NPS

clothing/accessories consistency Ignore hairstyle quality and hairstyle transfer quality completely. Score mapping: - 5: Preserved almost perfectly. - 4: Minor change but same identity/scene. - 3: Moderate change; still partially preserved. - 2: Major changes in multiple non-hair aspects. - 1: Non-hair preservation largely failed. Use applicable tags: - i...
[62]

Check boundary regions first: hairline, hair-skin, hair-background
[63]

Check face/skin texture and structural consistency
[64]

Evaluate ONLY these artifact categories:

Check global quality issues: blur, noise/compression, repetition. Evaluate ONLY these artifact categories:
[65]

boundary_blending_artifact
[66]

hairline_contour_artifact
[67]

unnatural_skin_texture
[68]

visible_patch_or_seam
[69]

structural_distortion
[70]

blur_or_focus_artifact
[71]

noise_or_compression_artifact
[72]

axis": "AQS

repetition_or_tiling_artifact Ignore all non-artifact factors: - hair-reference similarity - source identity preservation - attractiveness or style preference Score mapping: - 5: No meaningful artifact. - 4: One minor artifact. - 3: One clear artifact or several minor artifacts. - 2: Multiple major artifacts. - 1: Severe artifact failure. Tag from applica...
[73]

Select artifact tags that apply
[74]

axis": "AQS

Judge severity and assign final integer score 1..5. Allowed tags: - boundary_blending_artifact - hairline_contour_artifact - unnatural_skin_texture - visible_patch_or_seam - structural_distortion - blur_or_focus_artifact - noise_or_compression_artifact - repetition_or_tiling_artifact Scoring guidance from artifact severity: - 5: No meaningful artifacts. -...

[1] [1]

Jeong et al

Black Forest Labs: Flux.2 [klein]: Towards interactive visual intelligence,https: 16 S. Jeong et al. //bfl.ai/blog/flux2-klein-towards-interactive-visual-intelligence

[2] [2]

In: Pro- ceedings of the IEEE/CVF international conference on computer vision

Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In: Pro- ceedings of the IEEE/CVF international conference on computer vision. pp. 22560– 22570 (2023)

2023

[3] [3]

In: Forty-first International Conference on Machine Learning (2024)

Chen, D., Chen, R., Zhang, S., Wang, Y., Liu, Y., Zhou, H., Zhang, Q., Wan, Y., Zhou, P., Sun, L.: Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. In: Forty-first International Conference on Machine Learning (2024)

2024

[4] [4]

arXiv preprint arXiv:2206.08585 (2022)

Chung, C., Kim, T., Nam, H., Choi, S., Gu, G., Park, S., Choo, J.: Hairfit: pose- invariant hairstyle transfer via flow-based hair alignment and semantic-region- aware inpainting. arXiv preprint arXiv:2206.08585 (2022)

work page arXiv 2022

[5] [5]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Chung, C., Park, S., Kim, J., Choo, J.: What to preserve and what to transfer: Faithful, identity-preserving diffusion-based hairstyle transfer. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 2582–2590 (2025)

2025

[6] [6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., et al.: Gemini 2.5: Pushing the frontier with advanced reason- ing, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

arXiv preprint arXiv:2210.11427 (2022)

Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: Diffusion-based seman- tic image editing with mask guidance. arXiv preprint arXiv:2210.11427 (2022)

work page arXiv 2022

[8] [8]

google/models/gemini-image/pro/

Google DeepMind: Gemini 3 pro image – nano banana pro,https://deepmind. google/models/gemini-image/pro/

[9] [9]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Güler, R.A., Neverova, N., Kokkinos, I.: Densepose: Dense human pose estimation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7297–7306 (2018)

2018

[10] [10]

h94: Ip-adapter,https://huggingface.co/h94/IP-Adapter

[11] [11]

h94: Ip-adapter-faceid,https://huggingface.co/h94/IP-Adapter-FaceID

[12] [12]

Prompt-to-Prompt Image Editing with Cross Attention Control

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for im- proved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019)

2019

[15] [15]

In: European conference on computer vision

Kim, T., Chung, C., Kim, Y., Park, S., Kim, K., Choo, J.: Style your hair: Latent optimization for pose-invariant hairstyle transfer via local-style-aware hair align- ment. In: European conference on computer vision. pp. 188–203. Springer (2022)

2022

[16] [16]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Ku, M., Jiang, D., Wei, C., Yue, X., Chen, W.: Viescore: Towards explainable metrics for conditional image synthesis evaluation. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 12268–12290 (2024)

2024

[17] [17]

arXiv preprint arXiv:2304.13509 (2023)

Kvanchiani, K., Petrova, E., Efremyan, K., Sautin, A., Kapitanov, A.: Easyportrait–face parsing and portrait segmentation dataset. arXiv preprint arXiv:2304.13509 (2023)

work page arXiv 2023

[18] [18]

arXiv preprint arXiv:2203.06026 (2022)

Kynkäänniemi, T., Karras, T., Aittala, M., Aila, T., Lehtinen, J.: The role of imagenet classes in fr\’echet inception distance. arXiv preprint arXiv:2203.06026 (2022)

work page arXiv 2022

[19] [19]

LAION: Clip-vit-h-14-laion2b-s32b-b79k,https://huggingface.co/laion/CLIP- ViT-H-14-laion2B-s32B-b79K H-Adapter for Pose-Robust Hairstyle Transfer 17

[20] [20]

In: Findings of the Association for Computational Linguistics: ACL 2024

Lee, S., Kim, S., Park, S., Kim, G., Seo, M.: Prometheus-vision: Vision-language model as a judge for fine-grained evaluation. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 11286–11315 (2024)

2024

[21] [21]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [22]

Advances in Neural In- formation Processing Systems37, 45600–45635 (2024)

Nikolaev, M., Kuznetsov, M., Vetrov, D., Alanov, A.: Hairfastgan: Realistic and robust hair transfer with a fast encoder-based approach. Advances in Neural In- formation Processing Systems37, 45600–45635 (2024)

2024

[23] [23]

GPT-4o System Card

OpenAI: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

OpenAI GPT-5 System Card

OpenAI: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

OpenAI: Update to gpt-5 system card: Gpt-5.2.https://openai.com/index/gpt- 5-system-card-update-gpt-5-2/(2025)

2025

[26] [26]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

2021

[28] [28]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

2022

[30] [30]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Saha, R., Duke, B., Shkurti, F., Taylor, G.W., Aarabi, P.: Loho: Latent opti- mization of hairstyles via orthogonalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1984–1993 (2021)

1984

[31] [31]

Advances in neural information processing systems35, 36479–36494 (2022)

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in neural information processing systems35, 36479–36494 (2022)

2022

[32] [32]

co / SG161222 / Realistic_Vision_V4.0_noVAE

SG161222: Realistic vision v4.0 novae,https : / / huggingface . co / SG161222 / Realistic_Vision_V4.0_noVAE

[33] [33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Slyman, E., Tanjim, M., Kafle, K., Lee, S.: Calibrating mllm-as-a-judge via multi- modal bayesian prompt ensembles. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17224–17234 (2025)

2025

[34] [34]

Stability AI: sd-vae-ft-mse,https://huggingface.co/stabilityai/sd-vae-ft- mse

[35] [35]

stabilityai: stable-diffusion-v1-5/stable-diffusion-inpainting,https : / / huggingface.co/stable-diffusion-v1-5/stable-diffusion-inpainting

[36] [36]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep- tion architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2818–2826 (2016)

2016

[37] [37]

Tan, Z., Chai, M., Chen, D., Liao, J., Chu, Q., Yuan, L., Tulyakov, S., Yu, N.: Michigan:multi-input-conditionedhairimagegenerationforportraitediting.arXiv preprint arXiv:2010.16417 (2020)

work page arXiv 2010

[38] [38]

In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition

Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition. pp. 1921–1930 (2023) 18 S. Jeong et al

1921

[39] [39]

IEEE transactions on image processing 13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)

2004

[40] [40]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wei, T., Chen, D., Zhou, W., Liao, J., Tan, Z., Yuan, L., Zhang, W., Yu, N.: Hairclip: Design your hair by text and reference image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18072– 18081 (2022)

2022

[41] [41]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2026)

Wei, T., Chen, D., Zhou, W., Liao, J., Wang, C., Zhang, W., Hua, G., Yu, N.: Unifying multi-modal hair editing via proxy feature blending. IEEE Transactions on Pattern Analysis and Machine Intelligence (2026)

2026

[42] [42]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wei, T., Chen, D., Zhou, W., Liao, J., Zhang, W., Hua, G., Yu, N.: Hairclipv2: Unifying hair editing via proxy feature blending. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23589–23599 (2023)

2023

[43] [43]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

In: Proceedings of the European conference on computer vision (ECCV)

Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Bisenet: Bilateral segmenta- tion network for real-time semantic segmentation. In: Proceedings of the European conference on computer vision (ECCV). pp. 325–341 (2018)

2018

[45] [45]

Advances in Neural Information Processing Systems37, 5048–5073 (2024)

Zeng, Y., Zhang, Y., Jiachen, L., Shen, L., Deng, K., He, W., Wang, J.: Hairdif- fusion: Vivid multi-colored hair editing via latent diffusion. Advances in Neural Information Processing Systems37, 5048–5073 (2024)

2024

[46] [46]

ArXivabs/2311.01361(2023)

Zhang, X., Lu, Y., Wang, W., Yan, A., Yan, J., Qin, L., Wang, H., Yan, X., Wang, W.Y., Petzold, L.R.: Gpt-4v (ision) as a generalist evaluator for vision-language tasks. arXiv preprint arXiv:2311.01361 (2023)

work page arXiv 2023

[47] [47]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Zhang, Y., Zhang, Q., Song, Y., Zhang, J., Tang, H., Liu, J.: Stable-hair: Real- world hair transfer via diffusion model. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 10348–10356 (2025)

2025

[48] [48]

Qwen-Image-2.0 Technical Report

Zhao, B., Wu, C., Li, D., Meng, H., Li, J., Zhang, J., Zhou, J., Lin, J., Gao, K., Cao, K., et al.: Qwen-image-2.0 technical report. arXiv preprint arXiv:2605.10730 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[49] [49]

In: European conference on computer vision

Zhu, H., Wu, W., Zhu, W., Jiang, L., Tang, S., Zhang, L., Liu, Z., Loy, C.C.: Celebv-hq: A large-scale video facial attributes dataset. In: European conference on computer vision. pp. 650–667. Springer (2022)

2022

[50] [50]

arXiv preprint arXiv:2106.01505 (2021)

Zhu, P., Abdal, R., Femiani, J., Wonka, P.: Barbershop: Gan-based image com- positing using segmentation masks. arXiv preprint arXiv:2106.01505 (2021)

work page arXiv 2021

[51] [51]

In: European Conference on Computer Vision

Zhu, P., Abdal, R., Femiani, J., Wonka, P.: Hairnet: Hairstyle transfer with pose changes. In: European Conference on Computer Vision. pp. 651–667. Springer (2022)

2022

[52] [52]

watercolor painting of a man, soft wash, paper texture, gentle gradients, high quality, high detail

Zou, S., Tang, J., Zhou, Y., He, J., Zhao, C., Zhang, R., Hu, Z., Sun, X.: Towards efficientdiffusion-basedimageeditingwithinstantattentionmasks.In:Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 7864–7872 (2024) H-Adapter for Pose-Robust Hairstyle Transfer 19 Supplementary Material A Step-wise Analysis of Region-specific Loss ...

2024

[53] [53]

color and tone/highlights

[54] [54]

texture/curl pattern

[55] [55]

axis": "HFS

silhouette/volume and style identity (including bangs/part) Ignore ALL non-hair factors: - face identity - skin tone/lighting - background/clothes - general aesthetics Score mapping: - 5: Hair attributes are almost fully aligned with REFERENCE. - 4: Mostly aligned with minor inconsistency. - 3: Mixed; clear match in some attributes but clear mismatch in o...

[56] [56]

Determine which hair error tags apply

[57] [57]

axis": "HFS

Judge severity and assign final integer score 1..5. Allowed error tags: - silhouette_mismatch - bangs_part_mismatch - texture_mismatch - color_mismatch - hairline_mismatch - partial_transfer - style_identity_mismatch Scoring guidance from error severity: - 5: No meaningful hair errors. - 4: One minor error. - 3: One or two clear errors (moderate impact). ...

[58] [58]

facial identity geometry (eyes, nose, mouth, contour)

[59] [59]

skin tone and lighting/shadow

[60] [60]

background consistency

[61] [61]

axis": "NPS

clothing/accessories consistency Ignore hairstyle quality and hairstyle transfer quality completely. Score mapping: - 5: Preserved almost perfectly. - 4: Minor change but same identity/scene. - 3: Moderate change; still partially preserved. - 2: Major changes in multiple non-hair aspects. - 1: Non-hair preservation largely failed. Use applicable tags: - i...

[62] [62]

Check boundary regions first: hairline, hair-skin, hair-background

[63] [63]

Check face/skin texture and structural consistency

[64] [64]

Evaluate ONLY these artifact categories:

Check global quality issues: blur, noise/compression, repetition. Evaluate ONLY these artifact categories:

[65] [65]

boundary_blending_artifact

[66] [66]

hairline_contour_artifact

[67] [67]

unnatural_skin_texture

[68] [68]

visible_patch_or_seam

[69] [69]

structural_distortion

[70] [70]

blur_or_focus_artifact

[71] [71]

noise_or_compression_artifact

[72] [72]

axis": "AQS

repetition_or_tiling_artifact Ignore all non-artifact factors: - hair-reference similarity - source identity preservation - attractiveness or style preference Score mapping: - 5: No meaningful artifact. - 4: One minor artifact. - 3: One clear artifact or several minor artifacts. - 2: Multiple major artifacts. - 1: Severe artifact failure. Tag from applica...

[73] [73]

Select artifact tags that apply

[74] [74]

axis": "AQS

Judge severity and assign final integer score 1..5. Allowed tags: - boundary_blending_artifact - hairline_contour_artifact - unnatural_skin_texture - visible_patch_or_seam - structural_distortion - blur_or_focus_artifact - noise_or_compression_artifact - repetition_or_tiling_artifact Scoring guidance from artifact severity: - 5: No meaningful artifacts. -...