Durian: Dual Reference Image-Guided Portrait Animation with Attribute Transfer
Pith reviewed 2026-05-18 18:32 UTC · model grok-4.3
The pith
Durian learns cross-identity attribute transfer for portrait animation by self-reconstructing ordinary videos with a dual-reference diffusion model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Durian shows that a self-reconstruction objective on ordinary portrait videos suffices to learn attribute transfer across identities. Two frames from the same video serve as an identity reference and an attribute reference; complementary masking keeps their roles distinct while a Dual ReferenceNet processes them separately and fuses their features via spatial attention inside the diffusion model. The model is trained to reconstruct the original frame, thereby acquiring the capacity for cross-identity transfer. Mask expansion and targeted augmentations close the gap between this training regime and real cross-identity inference, yielding robust performance on attributes of varying spatial and
What carries the argument
The Dual ReferenceNet that processes identity and attribute references separately before fusing their features via spatial attention inside the diffusion model, guided by complementary masking.
If this is right
- The method achieves state-of-the-art results on portrait animation tasks that include attribute transfer.
- Its dual-reference design supports composing attributes from multiple references in one generation pass.
- Smooth attribute interpolation becomes possible within a single generation pass.
- Mask expansion and augmentations make transfer robust to variations in spatial extent and misalignment.
Where Pith is reading between the lines
- The same self-reconstruction pattern could be applied to other conditional video tasks where paired examples are scarce.
- Interpolation between attributes may give creators gradual control over generated expressions or styles.
- Extending the dual-stream design might improve fine-grained controllability in related face-editing applications.
Load-bearing premise
Ordinary video frames can serve as pseudo-pairs that let complementary masking and dual feature fusion teach the model to separate identity information from attribute information.
What would settle it
Generate animations from cross-identity references that contain large un-augmented pose or lighting differences and check whether identity preservation and attribute accuracy both remain high.
Figures
read the original abstract
We present Durian, the first method for generating portrait animation videos with cross-identity attribute transfer from one or more reference images to a target portrait. Training such models typically requires attribute pairs of the same individual, which are rarely available at scale. To address this challenge, we propose a self-reconstruction formulation that leverages ordinary portrait videos to learn attribute transfer without explicit paired data. Two frames from the same video act as a pseudo pair: one serves as an attribute reference and the other as an identity reference. To enable this self-reconstruction training, we introduce a Dual ReferenceNet that processes the two references separately and then fuses their features via spatial attention within a diffusion model. To make sure each reference functions as a specialized stream for either identity or attribute information, we apply complementary masking to the reference images. Together, these two components guide the model to reconstruct the original video, naturally learning cross-identity attribute transfer. To bridge the gap between self-reconstruction training and cross-identity inference, we introduce a mask expansion strategy and augmentation schemes, enabling robust transfer of attributes with varying spatial extent and misalignment. Durian achieves state-of-the-art performance on portrait animation with attribute transfer. Moreover, its dual reference design uniquely supports multi-attribute composition and smooth attribute interpolation within a single generation pass, enabling highly flexible and controllable synthesis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Durian, a method for portrait animation videos enabling cross-identity attribute transfer from one or more reference images to a target portrait. It introduces a self-reconstruction training scheme on ordinary portrait videos (using two frames as pseudo-pairs), a Dual ReferenceNet that processes references separately and fuses features via spatial attention in a diffusion model, complementary masking to specialize identity versus attribute streams, plus mask expansion and augmentations to close the train-inference gap. The work claims state-of-the-art performance on portrait animation with attribute transfer and unique support for multi-attribute composition and smooth interpolation in a single generation pass.
Significance. If the empirical claims hold, the self-reconstruction approach could reduce reliance on scarce paired attribute data, enabling more scalable training for controllable portrait video synthesis. The dual-reference architecture's support for composition and interpolation within one pass offers a practical advantage for flexible editing applications.
major comments (2)
- [Abstract] Abstract: the SOTA performance claim and the assertion that the dual reference design 'uniquely supports multi-attribute composition' are load-bearing for the central contribution, yet the provided description contains no quantitative tables, specific metrics, baseline comparisons, or ablation results to substantiate them.
- [Abstract] Training formulation (self-reconstruction objective): the claim that complementary masking plus dual feature fusion successfully separates identity and attribute information for cross-identity generalization rests on the unverified assumption that the model learns transferable streams rather than copying or leaking features from same-video frames; without explicit loss equations, separation metrics, or cross-identity error analysis this remains a potential gap.
minor comments (2)
- [Abstract] Abstract: the description of mask expansion and augmentation schemes could specify the exact expansion factors and augmentation parameters used, as these are listed among the free parameters.
- The manuscript would benefit from a dedicated limitations paragraph addressing failure cases such as extreme pose misalignment or attribute types with large spatial extent.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications based on the full paper content and indicating where revisions will be incorporated to strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: the SOTA performance claim and the assertion that the dual reference design 'uniquely supports multi-attribute composition' are load-bearing for the central contribution, yet the provided description contains no quantitative tables, specific metrics, baseline comparisons, or ablation results to substantiate them.
Authors: We agree that the abstract, being a high-level summary, does not contain the detailed quantitative evidence. The full manuscript substantiates the SOTA claims with tables in the Experiments section comparing against recent baselines on metrics including FID, FVD, and user preference scores, along with ablations on the dual-reference components. The unique support for multi-attribute composition and interpolation is shown via qualitative results and a dedicated subsection demonstrating single-pass generation. To make the abstract more self-contained while remaining concise, we will revise it to include brief references to key quantitative gains and the demonstrated capabilities. revision: yes
-
Referee: [Abstract] Training formulation (self-reconstruction objective): the claim that complementary masking plus dual feature fusion successfully separates identity and attribute information for cross-identity generalization rests on the unverified assumption that the model learns transferable streams rather than copying or leaking features from same-video frames; without explicit loss equations, separation metrics, or cross-identity error analysis this remains a potential gap.
Authors: This is a valid concern about potential intra-video leakage versus true cross-identity transfer. The manuscript presents the self-reconstruction objective, Dual ReferenceNet with spatial attention fusion, and complementary masking strategy in Section 3, along with the training loss formulation. Cross-identity generalization is evaluated on held-out identities in the results section. To further address the separation assumption, we will add a short analysis with stream-specific feature similarity metrics in the revised manuscript. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper introduces a new Dual ReferenceNet architecture and self-reconstruction training recipe on ordinary portrait videos to enable cross-identity attribute transfer without paired data. It relies on complementary masking, spatial attention fusion, mask expansion, and augmentations to separate identity and attribute streams. These are presented as an empirical training regime grounded in standard diffusion models, not as equations or derivations that reduce by construction to fitted parameters, self-citations, or renamed inputs. No load-bearing step quotes a uniqueness theorem, ansatz smuggled via prior work, or prediction that is statistically forced from the same data subset. The central claims about SOTA performance and multi-attribute composition are outcomes of the proposed method rather than tautological redefinitions.
Axiom & Free-Parameter Ledger
free parameters (1)
- masking ratios and expansion factors
axioms (1)
- domain assumption Diffusion models can be conditioned on multiple image references via spatial attention fusion.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Dual ReferenceNet that processes the two references separately and then fuses their features via spatial attention within a diffusion model... complementary masking to the reference images
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
self-reconstruction formulation that leverages ordinary portrait videos... mask expansion strategy and augmentation schemes
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
Vanast produces coherent garment-transferred human animation videos from a single human image, garment images, and pose guidance video using synthetic triplet supervision and a Dual Module video diffusion transformer ...
Reference graph
Works this paper leans on
-
[1]
Reference-based 3d-aware image editing with triplanes.arXiv preprint arXiv:2404.03632,
Bahri Batuhan Bilecen, Yigit Yalin, Ning Yu, and Aysegul Dundar. Reference-based 3d-aware image editing with triplanes.arXiv preprint arXiv:2404.03632,
-
[2]
Lan Chen, Qi Mao, Yuchao Gu, and Mike Zheng Shou. Edit transfer: Learning image editing via vision in-context relations.arXiv preprint arXiv:2503.13327,
-
[3]
arXiv preprint arXiv:2407.15886 , year=
Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Wenqing Zhang, Xujie Zhang, Hanqing Zhao, Dongmei Jiang, and Xiaodan Liang. Catvton: Concatenation is all you need for virtual try-on with diffusion models.arXiv preprint arXiv:2407.15886,
-
[4]
Vivid: Video virtual try-on using diffusion models,
Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, and Zheng-Jun Zha. Vivid: Video virtual try-on using diffusion models.arXiv preprint arXiv:2405.11794,
-
[5]
Liveportrait: Efficient portrait animation with stitching and retargeting control
10 arXiv preprint Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168,
-
[6]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference- free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Haircup: Hair compositional universal prior for 3d gaussian avatars.arXiv preprint arXiv:2507.19481,
Byungjun Kim, Shunsuke Saito, Giljoo Nam, Tomas Simon, Jason Saragih, Hanbyul Joo, and Junxuan Li. Haircup: Hair compositional universal prior for 3d gaussian avatars.arXiv preprint arXiv:2507.19481,
-
[9]
arXiv preprint arXiv:2403.14468 , year=
Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468,
-
[10]
11 arXiv preprint Yuhan Li, Hao Zhou, Wenxiang Shang, Ran Lin, Xuanhong Chen, and Bingbing Ni. Anyfit: Controllable virtual try-on for any combination of attire across any scenario.arXiv preprint arXiv:2405.18172,
-
[11]
Dreamo: A unified framework for image customization,
Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al. Dreamo: A unified framework for image customization. arXiv preprint arXiv:2504.16915,
-
[12]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Insert anything: Image insertion via in-context editing in dit.arXiv preprint arXiv:2504.15009,
Wensong Song, Hong Jiang, Zongxing Yang, Ruijie Quan, and Yi Yang. Insert anything: Image insertion via in-context editing in dit.arXiv preprint arXiv:2504.15009,
-
[14]
Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, and Zuxuan Wu. Stableani- mator: High-quality identity-preserving human image animation.arXiv preprint arXiv:2411.17697,
-
[15]
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control,
Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, and Hengshuang Zhao. Videoanydoor: High-fidelity video object insertion with precise motion control.arXiv preprint arXiv:2501.01427,
-
[16]
InstantID: Zero-shot Identity-Preserving Generation in Seconds
12 arXiv preprint Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis.arXiv preprint arXiv:1808.06601,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Stable- makeup: When real-world makeup transfer meets diffusion model
Yuxuan Zhang, Lifu Wei, Qing Zhang, Yiren Song, Jiaming Liu, Huaxia Li, Xu Tang, Yao Hu, and Haibo Zhao. Stable-makeup: When real-world makeup transfer meets diffusion model.arXiv preprint arXiv:2403.07764, 2024b. Yuxuan Zhang, Qing Zhang, Yiren Song, Jichao Zhang, Hao Tang, and Jiaming Liu. Stable-hair: Real-world hair transfer via diffusion model. InPro...
-
[20]
13 arXiv preprint A IMPLEMENTATIONDETAILS A.1 TRAININGDETAILS We adopt the two-stage training strategy following Zhu et al. (2024). In the first stage, we resize all videos to a uniform resolution of 512×512 pixels and train with a global batch size of 8 for 60,000 steps. During this phase, all layers except the temporal attention layers are set to be tra...
work page 2024
-
[21]
For cross-attribute transfer, we additionally sample 50 videos
and VFHQ (Xie et al., 2022), ensuring that these videos contain unseen identities, facial poses, and expressions relative to the training dataset. For cross-attribute transfer, we additionally sample 50 videos. Masks required for image editing baselines are constructed following the procedures provided by the respective authors. To construct cross-attribu...
work page 2022
-
[22]
(masked DINO) assess whether the target attribute is accurately transferred into the generated portrait animation video. To this end, we fill the background of attribute-only images with white and segment the target attribute region from the generated portrait animation video using Sapiens (Khirodkar et al., 2024). We then fill the segmented background wi...
work page 2024
-
[23]
extends FID to the video domain. Following Fang et al. (2024), we adopt VFID to measure temporal consistency and overall video quality. A.3 KEYPOINT GUIDANCE GENERATION Our model generates portrait animations using a guidance video composed of facial keypoints, as shown in Fig. 2 of our main paper. These keypoints encode entangled facial shape information...
work page 2024
-
[24]
to generate an animation of the portrait image that maintains its original shape while being driven by the motion in the guidance video. We then extract a facial keypoint guidance video from this animation using Sapiens (Khirodkar et al., 2024), effectively creating a self-reenactment-like scenario that allows our model to operate more reliably. 14 arXiv ...
work page 2024
-
[25]
4 of the main paper and present results in Fig
Qualitative comparison of cross-attribute transfer .We extend the comparison in Fig. 4 of the main paper and present results in Fig. 10 against 12 baselines for cross-attribute transfer setup. Our method best preserves the identity of the portrait image while most accurately transferring the hairstyle from the attribute image. Furthermore, our results are...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.