Durian: Dual Reference Image-Guided Portrait Animation with Attribute Transfer

Byungjun Kim; Hanbyul Joo; Hyunsoo Cha

arxiv: 2509.04434 · v3 · submitted 2025-09-04 · 💻 cs.CV

Durian: Dual Reference Image-Guided Portrait Animation with Attribute Transfer

Hyunsoo Cha , Byungjun Kim , Hanbyul Joo This is my paper

Pith reviewed 2026-05-18 18:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords portrait animationattribute transfercross-identitydiffusion modeldual referenceself-reconstructionvideo generationreference-guided synthesis

0 comments

The pith

Durian learns cross-identity attribute transfer for portrait animation by self-reconstructing ordinary videos with a dual-reference diffusion model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to generate animated portrait videos that apply attributes such as expressions or accessories taken from reference images of different people onto a target portrait. It solves the shortage of paired training data by turning ordinary portrait videos into pseudo-pairs: one frame supplies identity information and another supplies the attributes to be transferred. A Dual ReferenceNet handles the two references separately, applies complementary masking so each stream stays specialized, and fuses the features with spatial attention inside a diffusion model. The resulting training signal teaches the model to recombine identity and attributes across identities. When successful this produces state-of-the-art animation quality together with the ability to combine multiple attributes or interpolate between them in a single pass.

Core claim

Durian shows that a self-reconstruction objective on ordinary portrait videos suffices to learn attribute transfer across identities. Two frames from the same video serve as an identity reference and an attribute reference; complementary masking keeps their roles distinct while a Dual ReferenceNet processes them separately and fuses their features via spatial attention inside the diffusion model. The model is trained to reconstruct the original frame, thereby acquiring the capacity for cross-identity transfer. Mask expansion and targeted augmentations close the gap between this training regime and real cross-identity inference, yielding robust performance on attributes of varying spatial and

What carries the argument

The Dual ReferenceNet that processes identity and attribute references separately before fusing their features via spatial attention inside the diffusion model, guided by complementary masking.

If this is right

The method achieves state-of-the-art results on portrait animation tasks that include attribute transfer.
Its dual-reference design supports composing attributes from multiple references in one generation pass.
Smooth attribute interpolation becomes possible within a single generation pass.
Mask expansion and augmentations make transfer robust to variations in spatial extent and misalignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-reconstruction pattern could be applied to other conditional video tasks where paired examples are scarce.
Interpolation between attributes may give creators gradual control over generated expressions or styles.
Extending the dual-stream design might improve fine-grained controllability in related face-editing applications.

Load-bearing premise

Ordinary video frames can serve as pseudo-pairs that let complementary masking and dual feature fusion teach the model to separate identity information from attribute information.

What would settle it

Generate animations from cross-identity references that contain large un-augmented pose or lighting differences and check whether identity preservation and attribute accuracy both remain high.

Figures

Figures reproduced from arXiv: 2509.04434 by Byungjun Kim, Hanbyul Joo, Hyunsoo Cha.

**Figure 1.** Figure 1: Portrait Animation with Attribute Transfer. Given a portrait image and single or multiple reference images specifying target attributes (e.g., hairstyle, eyeglasses), our method generates a portrait animation with facial attribute transfer conditioned on a keypoint sequence. ABSTRACT We present Durian, the first method for generating portrait animation videos with cross-identity attribute transfer from one… view at source ↗

**Figure 2.** Figure 2: Overview of Training Pipeline. Given an attribute-masked portrait image ˜Iport and an attribute-only image ˜Iattr, Durian synthesizes a portrait animation with the transferred attribute. These inputs are constructed by randomly sampling two frames from a training video and applying the estimated masks. A sequence of facial keypoints {kτ } F τ=1 is extracted from the video to guide the motion. During genera… view at source ↗

**Figure 3.** Figure 3: Aligned Attribute Mask Estimation. To improve attribute-portrait alignment, we estimate an aligned attribute mask via Face Aligner. Inference pipeline. At inference time, our system takes as input a portrait image, an attribute image, and a keypoint sequence. We first construct two masked reference images: the attribute-only image ˜Iattr and the attributemasked portrait image ˜Iport, by applying segmen… view at source ↗

**Figure 4.** Figure 4: Qualitative Comparison for Cross-Attribute Transfer. We compare our method and the baselines that combine X-Portrait (Xie et al., 2024) with StableHair (Zhang et al., 2025) in cross-identity transfer setup. We provide more results in our Supp. Mat. Single ReferenceNet w/o ref. mask input full ref. image input Portrait Hair Ours w/o mask expansion w/o ref. image aug [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation Study. Omitting components or altering training scheme degrades visual quality. Dataset. We train our model on CelebV-Text (Yu et al., 2023), VFHQ (Xie et al., 2022), and Nersemble (Kirschstein et al., 2023), totaling 2,747 videos. For evaluation, we sample 200 videos for self-attribute transfer and 50 videos for cross-attribute transfer from CelebV-Text and VFHQ, ensuring diverse and unseen ident… view at source ↗

**Figure 6.** Figure 6: Multi-Attribute Transfer. Our model supports composition of multiple attributes (e.g., hair, eyeglasses, beard, hat) in a single forward pass without additional training. Portrait Hair A Hair B [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Attribute Interpolation. Our model enables smooth and consistent transitions between hair attributes by varying the interpolation parameter α. More examples are in our Supp. Mat. and attribute images concatenated along the channel dimension, following CAT-VTON (Chong et al., 2024). This setup fails to separate the roles of the two inputs, resulting in undesired blending of attribute and identity cues. “w/o… view at source ↗

**Figure 8.** Figure 8: Ablation Study for Face Aligner. Omitting Face Aligner at inference time degrades the visual quality of the generated animation. LivePortrait X-Portrait MegActor-Σ GT PbE StableHair TriplaneEdit OursHairFusion LivePortrait X-Portrait X-Portrait MegActor-Σ LivePortrait X-Portrait MegActor-Σ LivePortrait MegActor-Σ Video Frames [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative Comparison of Self-Attribute Transfer in the Hair Category. We compare our method and the baselines that combine portrait animation method with image or hairstyle editing methods. Our results show the highest quality closest to the ground truth, while other methods produce artifacts or unnatural appearances. B ADDITIONAL RESULTS B.1 ADDITIONAL ABLATION STUDY FOR FACE ALIGNER We perform an ablat… view at source ↗

**Figure 10.** Figure 10: Qualitative Comparison of Cross-Attribute Transfer in the Hair Category. We compare our method with the baselines that combine image editing and portrait animation. Our results best preserve the identity of the portrait image while most effectively transferring the hairstyle. Video Frames GT TE+LP Ours Video Frames [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative Comparison of Self-Attribute Transfer in the Eyeglasses Category. TE represents TriplaneEdit and LP denotes LivePortrait. In the self-attribute transfer setting on the eyeglasses category, we compare our results with baseline. Our method produces portrait animations most similar to the ground truth while remaining the most natural. B.3 ADDITIONAL QUANTITATIVE COMPARISON As shown in [PITH_FULL… view at source ↗

**Figure 12.** Figure 12: Qualitative Results for Single-Attribute Transfer. We present additional results on hair, hat, eyeglasses, and beard attribute transfer for portrait animation. Our method preserves the fine details of the original portrait while achieving natural and seamless attribute transfer. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative Results for Dual-Attribute Transfer. We demonstrate the results of simultaneously transferring two attributes for portrait animation. Hair+Beard+Eyeglasses Hat+Beard+Eyeglasses Reference Portrait Animation with Attribute Transfer [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative Results for Triple-Attribute Transfer. We present the results of simultaneously transferring three attributes. In each example, the image in the top-left corner indicates the target portrait. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Attribute Interpolation. We demonstrate smooth and consistent interpolation of additional attributes such as beard, eyeglasses, and hat according to the α values, extending beyond the hair interpolation results shown in the main paper. Portrait Reference Portrait Animation with Attribute Transfer Hair Portrait Hair Portrait Hair Portrait Hair Portrait Hair [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Text-to-Image Generated Attribute Transfer for Portrait Animation. We generate a portrait animation with attribute transfer from a textual description by using FLUX (Labs, 2024) to synthesize a high-quality portrait image with the desired hair attribute. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

read the original abstract

We present Durian, the first method for generating portrait animation videos with cross-identity attribute transfer from one or more reference images to a target portrait. Training such models typically requires attribute pairs of the same individual, which are rarely available at scale. To address this challenge, we propose a self-reconstruction formulation that leverages ordinary portrait videos to learn attribute transfer without explicit paired data. Two frames from the same video act as a pseudo pair: one serves as an attribute reference and the other as an identity reference. To enable this self-reconstruction training, we introduce a Dual ReferenceNet that processes the two references separately and then fuses their features via spatial attention within a diffusion model. To make sure each reference functions as a specialized stream for either identity or attribute information, we apply complementary masking to the reference images. Together, these two components guide the model to reconstruct the original video, naturally learning cross-identity attribute transfer. To bridge the gap between self-reconstruction training and cross-identity inference, we introduce a mask expansion strategy and augmentation schemes, enabling robust transfer of attributes with varying spatial extent and misalignment. Durian achieves state-of-the-art performance on portrait animation with attribute transfer. Moreover, its dual reference design uniquely supports multi-attribute composition and smooth attribute interpolation within a single generation pass, enabling highly flexible and controllable synthesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Durian gives a practical self-supervised route to cross-identity attribute transfer in portrait videos by pairing frames from ordinary clips and using dual references with masking.

read the letter

The main point is that Durian trains on regular portrait videos by treating two frames as a pseudo-pair: one for identity, one for attributes. A Dual ReferenceNet handles them separately, fuses via spatial attention inside a diffusion backbone, and applies complementary masking so each stream specializes. Mask expansion and augmentations then help the model handle real cross-identity cases at test time. The same setup also supports composing multiple attributes or interpolating them smoothly in one forward pass.

Referee Report

2 major / 2 minor

Summary. The paper presents Durian, a method for portrait animation videos enabling cross-identity attribute transfer from one or more reference images to a target portrait. It introduces a self-reconstruction training scheme on ordinary portrait videos (using two frames as pseudo-pairs), a Dual ReferenceNet that processes references separately and fuses features via spatial attention in a diffusion model, complementary masking to specialize identity versus attribute streams, plus mask expansion and augmentations to close the train-inference gap. The work claims state-of-the-art performance on portrait animation with attribute transfer and unique support for multi-attribute composition and smooth interpolation in a single generation pass.

Significance. If the empirical claims hold, the self-reconstruction approach could reduce reliance on scarce paired attribute data, enabling more scalable training for controllable portrait video synthesis. The dual-reference architecture's support for composition and interpolation within one pass offers a practical advantage for flexible editing applications.

major comments (2)

[Abstract] Abstract: the SOTA performance claim and the assertion that the dual reference design 'uniquely supports multi-attribute composition' are load-bearing for the central contribution, yet the provided description contains no quantitative tables, specific metrics, baseline comparisons, or ablation results to substantiate them.
[Abstract] Training formulation (self-reconstruction objective): the claim that complementary masking plus dual feature fusion successfully separates identity and attribute information for cross-identity generalization rests on the unverified assumption that the model learns transferable streams rather than copying or leaking features from same-video frames; without explicit loss equations, separation metrics, or cross-identity error analysis this remains a potential gap.

minor comments (2)

[Abstract] Abstract: the description of mask expansion and augmentation schemes could specify the exact expansion factors and augmentation parameters used, as these are listed among the free parameters.
The manuscript would benefit from a dedicated limitations paragraph addressing failure cases such as extreme pose misalignment or attribute types with large spatial extent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications based on the full paper content and indicating where revisions will be incorporated to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the SOTA performance claim and the assertion that the dual reference design 'uniquely supports multi-attribute composition' are load-bearing for the central contribution, yet the provided description contains no quantitative tables, specific metrics, baseline comparisons, or ablation results to substantiate them.

Authors: We agree that the abstract, being a high-level summary, does not contain the detailed quantitative evidence. The full manuscript substantiates the SOTA claims with tables in the Experiments section comparing against recent baselines on metrics including FID, FVD, and user preference scores, along with ablations on the dual-reference components. The unique support for multi-attribute composition and interpolation is shown via qualitative results and a dedicated subsection demonstrating single-pass generation. To make the abstract more self-contained while remaining concise, we will revise it to include brief references to key quantitative gains and the demonstrated capabilities. revision: yes
Referee: [Abstract] Training formulation (self-reconstruction objective): the claim that complementary masking plus dual feature fusion successfully separates identity and attribute information for cross-identity generalization rests on the unverified assumption that the model learns transferable streams rather than copying or leaking features from same-video frames; without explicit loss equations, separation metrics, or cross-identity error analysis this remains a potential gap.

Authors: This is a valid concern about potential intra-video leakage versus true cross-identity transfer. The manuscript presents the self-reconstruction objective, Dual ReferenceNet with spatial attention fusion, and complementary masking strategy in Section 3, along with the training loss formulation. Cross-identity generalization is evaluated on held-out identities in the results section. To further address the separation assumption, we will add a short analysis with stream-specific feature similarity metrics in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a new Dual ReferenceNet architecture and self-reconstruction training recipe on ordinary portrait videos to enable cross-identity attribute transfer without paired data. It relies on complementary masking, spatial attention fusion, mask expansion, and augmentations to separate identity and attribute streams. These are presented as an empirical training regime grounded in standard diffusion models, not as equations or derivations that reduce by construction to fitted parameters, self-citations, or renamed inputs. No load-bearing step quotes a uniqueness theorem, ansatz smuggled via prior work, or prediction that is statistically forced from the same data subset. The central claims about SOTA performance and multi-attribute composition are outcomes of the proposed method rather than tautological redefinitions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that standard diffusion model training dynamics plus the proposed masking and fusion will separate identity and attribute signals; many typical diffusion hyperparameters (learning rate schedules, noise schedules, attention layers) are implicitly used but not enumerated as free parameters here.

free parameters (1)

masking ratios and expansion factors
Complementary masking ratios and mask expansion parameters are chosen to enforce specialization and handle misalignment; these are tuned for the self-reconstruction task.

axioms (1)

domain assumption Diffusion models can be conditioned on multiple image references via spatial attention fusion.
Invoked when describing the Dual ReferenceNet integration inside the diffusion backbone.

pith-pipeline@v0.9.0 · 5767 in / 1329 out tokens · 40473 ms · 2026-05-18T18:32:34.124036+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Dual ReferenceNet that processes the two references separately and then fuses their features via spatial attention within a diffusion model... complementary masking to the reference images
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

self-reconstruction formulation that leverages ordinary portrait videos... mask expansion strategy and augmentation schemes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
cs.CV 2026-04 unverdicted novelty 6.0

Vanast produces coherent garment-transferred human animation videos from a single human image, garment images, and pose guidance video using synthetic triplet supervision and a Dual Module video diffusion transformer ...

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Reference-based 3d-aware image editing with triplanes.arXiv preprint arXiv:2404.03632,

Bahri Batuhan Bilecen, Yigit Yalin, Ning Yu, and Aysegul Dundar. Reference-based 3d-aware image editing with triplanes.arXiv preprint arXiv:2404.03632,

work page arXiv
[2]

Edit transfer: Learning image editing via vision in-context relations.arXiv preprint arXiv:2503.13327,

Lan Chen, Qi Mao, Yuchao Gu, and Mike Zheng Shou. Edit transfer: Learning image editing via vision in-context relations.arXiv preprint arXiv:2503.13327,

work page arXiv
[3]

arXiv preprint arXiv:2407.15886 , year=

Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Wenqing Zhang, Xujie Zhang, Hanqing Zhao, Dongmei Jiang, and Xiaodan Liang. Catvton: Concatenation is all you need for virtual try-on with diffusion models.arXiv preprint arXiv:2407.15886,

work page arXiv
[4]

Vivid: Video virtual try-on using diffusion models,

Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, and Zheng-Jun Zha. Vivid: Video virtual try-on using diffusion models.arXiv preprint arXiv:2405.11794,

work page arXiv
[5]

Liveportrait: Efficient portrait animation with stitching and retargeting control

10 arXiv preprint Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168,

work page arXiv
[6]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference- free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Haircup: Hair compositional universal prior for 3d gaussian avatars.arXiv preprint arXiv:2507.19481,

Byungjun Kim, Shunsuke Saito, Giljoo Nam, Tomas Simon, Jason Saragih, Hanbyul Joo, and Junxuan Li. Haircup: Hair compositional universal prior for 3d gaussian avatars.arXiv preprint arXiv:2507.19481,

work page arXiv
[9]

arXiv preprint arXiv:2403.14468 , year=

Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468,

work page arXiv
[10]

Anyfit: Controllable virtual try-on for any combination of attire across any scenario.arXiv preprint arXiv:2405.18172,

11 arXiv preprint Yuhan Li, Hao Zhou, Wenxiang Shang, Ran Lin, Xuanhong Chen, and Bingbing Ni. Anyfit: Controllable virtual try-on for any combination of attire across any scenario.arXiv preprint arXiv:2405.18172,

work page arXiv
[11]

Dreamo: A unified framework for image customization,

Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al. Dreamo: A unified framework for image customization. arXiv preprint arXiv:2504.16915,

work page arXiv
[12]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Insert anything: Image insertion via in-context editing in dit.arXiv preprint arXiv:2504.15009,

Wensong Song, Hong Jiang, Zongxing Yang, Ruijie Quan, and Yi Yang. Insert anything: Image insertion via in-context editing in dit.arXiv preprint arXiv:2504.15009,

work page arXiv
[14]

Stableani- mator: High-quality identity-preserving human image animation.arXiv preprint arXiv:2411.17697,

Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu. Stableani- mator: High-quality identity-preserving human image animation.arXiv preprint arXiv:2411.17697,

work page arXiv
[15]

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control,

Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, and Hengshuang Zhao. Videoanydoor: High-fidelity video object insertion with precise motion control.arXiv preprint arXiv:2501.01427,

work page arXiv
[16]

InstantID: Zero-shot Identity-Preserving Generation in Seconds

12 arXiv preprint Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Video-to-Video Synthesis

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis.arXiv preprint arXiv:1808.06601,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Stable- makeup: When real-world makeup transfer meets diffusion model

Yuxuan Zhang, Lifu Wei, Qing Zhang, Yiren Song, Jiaming Liu, Huaxia Li, Xu Tang, Yao Hu, and Haibo Zhao. Stable-makeup: When real-world makeup transfer meets diffusion model.arXiv preprint arXiv:2403.07764, 2024b. Yuxuan Zhang, Qing Zhang, Yiren Song, Jichao Zhang, Hao Tang, and Jiaming Liu. Stable-hair: Real-world hair transfer via diffusion model. InPro...

work page arXiv
[20]

13 arXiv preprint A IMPLEMENTATIONDETAILS A.1 TRAININGDETAILS We adopt the two-stage training strategy following Zhu et al. (2024). In the first stage, we resize all videos to a uniform resolution of 512×512 pixels and train with a global batch size of 8 for 60,000 steps. During this phase, all layers except the temporal attention layers are set to be tra...

work page 2024
[21]

For cross-attribute transfer, we additionally sample 50 videos

and VFHQ (Xie et al., 2022), ensuring that these videos contain unseen identities, facial poses, and expressions relative to the training dataset. For cross-attribute transfer, we additionally sample 50 videos. Masks required for image editing baselines are constructed following the procedures provided by the respective authors. To construct cross-attribu...

work page 2022
[22]

(masked DINO) assess whether the target attribute is accurately transferred into the generated portrait animation video. To this end, we fill the background of attribute-only images with white and segment the target attribute region from the generated portrait animation video using Sapiens (Khirodkar et al., 2024). We then fill the segmented background wi...

work page 2024
[23]

Following Fang et al

extends FID to the video domain. Following Fang et al. (2024), we adopt VFID to measure temporal consistency and overall video quality. A.3 KEYPOINT GUIDANCE GENERATION Our model generates portrait animations using a guidance video composed of facial keypoints, as shown in Fig. 2 of our main paper. These keypoints encode entangled facial shape information...

work page 2024
[24]

to generate an animation of the portrait image that maintains its original shape while being driven by the motion in the guidance video. We then extract a facial keypoint guidance video from this animation using Sapiens (Khirodkar et al., 2024), effectively creating a self-reenactment-like scenario that allows our model to operate more reliably. 14 arXiv ...

work page 2024
[25]

4 of the main paper and present results in Fig

Qualitative comparison of cross-attribute transfer .We extend the comparison in Fig. 4 of the main paper and present results in Fig. 10 against 12 baselines for cross-attribute transfer setup. Our method best preserves the identity of the portrait image while most accurately transferring the hairstyle from the attribute image. Furthermore, our results are...

work page 2024

[1] [1]

Reference-based 3d-aware image editing with triplanes.arXiv preprint arXiv:2404.03632,

Bahri Batuhan Bilecen, Yigit Yalin, Ning Yu, and Aysegul Dundar. Reference-based 3d-aware image editing with triplanes.arXiv preprint arXiv:2404.03632,

work page arXiv

[2] [2]

Edit transfer: Learning image editing via vision in-context relations.arXiv preprint arXiv:2503.13327,

Lan Chen, Qi Mao, Yuchao Gu, and Mike Zheng Shou. Edit transfer: Learning image editing via vision in-context relations.arXiv preprint arXiv:2503.13327,

work page arXiv

[3] [3]

arXiv preprint arXiv:2407.15886 , year=

Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Wenqing Zhang, Xujie Zhang, Hanqing Zhao, Dongmei Jiang, and Xiaodan Liang. Catvton: Concatenation is all you need for virtual try-on with diffusion models.arXiv preprint arXiv:2407.15886,

work page arXiv

[4] [4]

Vivid: Video virtual try-on using diffusion models,

Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, and Zheng-Jun Zha. Vivid: Video virtual try-on using diffusion models.arXiv preprint arXiv:2405.11794,

work page arXiv

[5] [5]

Liveportrait: Efficient portrait animation with stitching and retargeting control

10 arXiv preprint Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168,

work page arXiv

[6] [6]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference- free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Haircup: Hair compositional universal prior for 3d gaussian avatars.arXiv preprint arXiv:2507.19481,

Byungjun Kim, Shunsuke Saito, Giljoo Nam, Tomas Simon, Jason Saragih, Hanbyul Joo, and Junxuan Li. Haircup: Hair compositional universal prior for 3d gaussian avatars.arXiv preprint arXiv:2507.19481,

work page arXiv

[9] [9]

arXiv preprint arXiv:2403.14468 , year=

Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468,

work page arXiv

[10] [10]

Anyfit: Controllable virtual try-on for any combination of attire across any scenario.arXiv preprint arXiv:2405.18172,

11 arXiv preprint Yuhan Li, Hao Zhou, Wenxiang Shang, Ran Lin, Xuanhong Chen, and Bingbing Ni. Anyfit: Controllable virtual try-on for any combination of attire across any scenario.arXiv preprint arXiv:2405.18172,

work page arXiv

[11] [11]

Dreamo: A unified framework for image customization,

Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al. Dreamo: A unified framework for image customization. arXiv preprint arXiv:2504.16915,

work page arXiv

[12] [12]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Insert anything: Image insertion via in-context editing in dit.arXiv preprint arXiv:2504.15009,

Wensong Song, Hong Jiang, Zongxing Yang, Ruijie Quan, and Yi Yang. Insert anything: Image insertion via in-context editing in dit.arXiv preprint arXiv:2504.15009,

work page arXiv

[14] [14]

Stableani- mator: High-quality identity-preserving human image animation.arXiv preprint arXiv:2411.17697,

Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu. Stableani- mator: High-quality identity-preserving human image animation.arXiv preprint arXiv:2411.17697,

work page arXiv

[15] [15]

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control,

Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, and Hengshuang Zhao. Videoanydoor: High-fidelity video object insertion with precise motion control.arXiv preprint arXiv:2501.01427,

work page arXiv

[16] [16]

InstantID: Zero-shot Identity-Preserving Generation in Seconds

12 arXiv preprint Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Video-to-Video Synthesis

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis.arXiv preprint arXiv:1808.06601,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Stable- makeup: When real-world makeup transfer meets diffusion model

Yuxuan Zhang, Lifu Wei, Qing Zhang, Yiren Song, Jiaming Liu, Huaxia Li, Xu Tang, Yao Hu, and Haibo Zhao. Stable-makeup: When real-world makeup transfer meets diffusion model.arXiv preprint arXiv:2403.07764, 2024b. Yuxuan Zhang, Qing Zhang, Yiren Song, Jichao Zhang, Hao Tang, and Jiaming Liu. Stable-hair: Real-world hair transfer via diffusion model. InPro...

work page arXiv

[20] [20]

13 arXiv preprint A IMPLEMENTATIONDETAILS A.1 TRAININGDETAILS We adopt the two-stage training strategy following Zhu et al. (2024). In the first stage, we resize all videos to a uniform resolution of 512×512 pixels and train with a global batch size of 8 for 60,000 steps. During this phase, all layers except the temporal attention layers are set to be tra...

work page 2024

[21] [21]

For cross-attribute transfer, we additionally sample 50 videos

and VFHQ (Xie et al., 2022), ensuring that these videos contain unseen identities, facial poses, and expressions relative to the training dataset. For cross-attribute transfer, we additionally sample 50 videos. Masks required for image editing baselines are constructed following the procedures provided by the respective authors. To construct cross-attribu...

work page 2022

[22] [22]

(masked DINO) assess whether the target attribute is accurately transferred into the generated portrait animation video. To this end, we fill the background of attribute-only images with white and segment the target attribute region from the generated portrait animation video using Sapiens (Khirodkar et al., 2024). We then fill the segmented background wi...

work page 2024

[23] [23]

Following Fang et al

extends FID to the video domain. Following Fang et al. (2024), we adopt VFID to measure temporal consistency and overall video quality. A.3 KEYPOINT GUIDANCE GENERATION Our model generates portrait animations using a guidance video composed of facial keypoints, as shown in Fig. 2 of our main paper. These keypoints encode entangled facial shape information...

work page 2024

[24] [24]

to generate an animation of the portrait image that maintains its original shape while being driven by the motion in the guidance video. We then extract a facial keypoint guidance video from this animation using Sapiens (Khirodkar et al., 2024), effectively creating a self-reenactment-like scenario that allows our model to operate more reliably. 14 arXiv ...

work page 2024

[25] [25]

4 of the main paper and present results in Fig

Qualitative comparison of cross-attribute transfer .We extend the comparison in Fig. 4 of the main paper and present results in Fig. 10 against 12 baselines for cross-attribute transfer setup. Our method best preserves the identity of the portrait image while most accurately transferring the hairstyle from the attribute image. Furthermore, our results are...

work page 2024