pith. machine review for the scientific record. sign in

arxiv: 2512.07951 · v2 · submitted 2025-12-08 · 💻 cs.CV

Recognition: unknown

Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords face swappingvideo face swappingtemporal consistencykeyframe conditioningreference-guided editingidentity preservationcinematic video
0
0 comments X

The pith

LivingSwap conditions video face swaps on keyframes and source reference to preserve expressions, lighting, and motion across long sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video face swapping in film production demands that a new identity replace the original face without disrupting the source clip's timing, expressions, lighting, or camera motion. The paper introduces LivingSwap, which selects keyframes from the source video to guide identity injection while using the full source video as a reference signal for temporal stitching between frames. This combination allows the model to maintain coherence over extended, complex sequences where earlier methods introduce flickering or drift. To train the system, the authors built the Face2Face paired dataset and reversed the pairs to create reliable ground-truth supervision. Experiments show the resulting swaps integrate the target face more naturally into the source video's dynamics than prior techniques.

Core claim

LivingSwap is the first video reference-guided face swapping model that employs keyframes as conditioning signals to inject the target identity and performs temporal stitching by combining keyframe conditioning with video reference guidance, thereby ensuring stable identity preservation and high-fidelity reconstruction of expressions, lighting, and motion across long video sequences.

What carries the argument

Keyframe conditioning signals combined with video reference guidance for temporal stitching.

If this is right

  • Target identity integrates with source expressions, lighting, and motion without manual cleanup
  • Temporal coherence holds across extended video sequences
  • Production workflows require substantially less manual intervention for face replacement
  • State-of-the-art fidelity on reference-guided video face swapping benchmarks

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same keyframe-plus-reference pattern could be tested on other reference-guided video tasks such as object insertion or style transfer
  • If the reversed-pair construction proves robust, similar dataset reversal might improve supervision for other paired video editing problems
  • Deployment on longer films would require checking whether drift accumulates beyond the lengths seen in Face2Face training clips

Load-bearing premise

The Face2Face dataset and its reversed pairs supply unbiased ground-truth supervision that generalizes to arbitrary long, complex cinematic videos without artifacts or identity leakage.

What would settle it

Visible identity leakage, loss of source lighting consistency, or temporal artifacts in a long real-world cinematic sequence whose motion and lighting patterns fall outside the distribution of the reversed Face2Face pairs.

Figures

Figures reproduced from arXiv: 2512.07951 by Chenchen Jing, Chunhua Shen, Hao Chen, Hao Zhong, Muzhi Zhu, Wen Wang, Yuling Xi, Zekai Luo, Zhouhang Zhu, Zongze Du.

Figure 1
Figure 1. Figure 1: Qualitative results of our proposed video reference guided face swapping model, L [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) GAN-based approaches process videos in a frame-by-frame manner, and therefore often struggle with realism and suffer from [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed LIVINGSWAP framework for video face swapping. (1) Keyframes are used as temporal anchors to ensure consistent identity injection across long sequences. (2) We feed the source video as a reference, enabling high-fidelity reconstruction of non-identity attributes such as lighting and expressions. (3) By sequentially generating chunks and propagating the final frame of the previous ch… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with state-of-the-art face-swapping methods. L [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of the Face2Face dataset. The central plot shows the distribution of identity similarity scores between each swapped video and its corresponding original video, with the lowest 30% (red) and highest 30% (blue) highlighted. Low-similarity pairs often contain artifacts and distortions as significant identity discrepancies (left), while high-similarity pairs may contain failed swap frames, causi… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison between the data pairs in [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Keyframes Identity Injection for resolving Accumulated ID Errors. Keyframe Identity Injection for Resolving Accumulated ID [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison of using different image-level face swapping models as Per-frame Edit module. Injected keyframes often [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Identity swapping results on the same source video with different target identities. Our method produces consistent and high [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Face swapping results on diverse source videos with the same target identity. Our method consistently preserves target identity [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Grayscale keyframe guidance. To avoid incorrect color propagation from imperfect edited keyframes, we modify Video Ref￾erence Completion module and convert each keyframe to a grayscale image before VAE encoding. This preserves structural cues (identity, pose, shading) while removing misleading chromatic information, allowing the model to recover accurate colors from the reference video. H. CineFaceBench C… view at source ↗
Figure 12
Figure 12. Figure 12: Compared with the original LIVINGSWAP, using grayscale keyframes effectively suppresses color bleeding (e.g., the blue tint near the ear in the first example) and reduces temporal flickering artifacts (e.g., the dark patches on the head in the second example), leading to more stable and faithful video face swapping results. constructed a film scene face-swapping benchmark, Cine￾FaceBench. CineFaceBench co… view at source ↗
Figure 13
Figure 13. Figure 13: Additional Qualitative Comparison of Different Methods on [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparison with recent inpainting-based video face swapping methods [ [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
read the original abstract

Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video's expressions, lighting, and motion, while significantly reducing manual effort in production workflows. Project webpage: https://aim-uofa.github.io/LivingSwap

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces LivingSwap, the first video reference-guided face swapping model. It uses keyframes as conditioning signals to inject target identity, combines this with video reference guidance, and applies temporal stitching to maintain identity preservation and high-fidelity reconstruction across long video sequences. To enable training, the authors construct a paired Face2Face dataset and reverse the data pairs for ground-truth supervision. The central claim is that the method achieves state-of-the-art results by seamlessly integrating target identity with source expressions, lighting, and motion while reducing manual effort in cinematic production workflows.

Significance. If the performance claims hold under rigorous evaluation, the work could advance practical video face swapping for film and entertainment by improving temporal consistency and realism in complex sequences. The reference-guided approach adapted from image editing to video is a reasonable direction for controllable editing. However, the current lack of quantitative metrics, baselines, and generalization tests makes it difficult to determine whether the contribution meaningfully exceeds existing methods.

major comments (3)
  1. [Abstract] Abstract: The state-of-the-art claim is unsupported by any quantitative metrics, baseline comparisons, ablation studies, or error analysis, which are required to substantiate performance assertions in a computer vision manuscript on face swapping.
  2. [Dataset Construction] Dataset Construction: Reversing pairs in the newly constructed Face2Face dataset to create supervision may embed source-specific correlations in lighting, motion statistics, and identity cues rather than producing clean target swaps; this risks overfitting in the keyframe conditioning and temporal stitching modules and undermines generalization claims to arbitrary long cinematic sequences.
  3. [Method] Method: The keyframe selection and conditioning strength are listed as free parameters without a clear protocol for their determination or robustness analysis across diverse video lengths and complexities, leaving the central controllability claim load-bearing but unverified.
minor comments (2)
  1. [Abstract] Abstract: The project webpage is referenced but no information on code or model availability is provided, which would aid reproducibility.
  2. [Introduction] The manuscript would benefit from additional citations to recent video face swapping and temporal consistency methods to better position the novelty of the reference-guided temporal stitching.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The state-of-the-art claim is unsupported by any quantitative metrics, baseline comparisons, ablation studies, or error analysis, which are required to substantiate performance assertions in a computer vision manuscript on face swapping.

    Authors: We acknowledge that the SOTA claim requires stronger quantitative backing. The current manuscript prioritizes qualitative visual results and temporal consistency demonstrations across long sequences, but we agree this is insufficient. In the revised version we will add quantitative metrics (e.g., FID, LPIPS, and a temporal consistency score), direct comparisons against recent baselines, and ablation studies on the keyframe conditioning and temporal stitching modules. revision: yes

  2. Referee: [Dataset Construction] Dataset Construction: Reversing pairs in the newly constructed Face2Face dataset to create supervision may embed source-specific correlations in lighting, motion statistics, and identity cues rather than producing clean target swaps; this risks overfitting in the keyframe conditioning and temporal stitching modules and undermines generalization claims to arbitrary long cinematic sequences.

    Authors: The reversal creates paired supervision by swapping the roles of source and target so the model learns identity injection while preserving the original source attributes; the Face2Face dataset was built from diverse cinematic clips to reduce spurious correlations. We accept that additional validation is warranted and will add generalization tests on out-of-distribution long sequences in the revision. revision: partial

  3. Referee: [Method] Method: The keyframe selection and conditioning strength are listed as free parameters without a clear protocol for their determination or robustness analysis across diverse video lengths and complexities, leaving the central controllability claim load-bearing but unverified.

    Authors: Keyframe selection uses a motion-threshold protocol (detailed in Section 3.2) and conditioning strength is set via validation-set tuning. We will expand the method description with an explicit protocol and add a robustness subsection reporting performance across video lengths and parameter ranges to verify controllability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical ML method for video face swapping that constructs the Face2Face dataset and reverses pairs to create supervision signals. No mathematical derivations, equations, or first-principles results are described that reduce by construction to fitted parameters or self-referential inputs. Central claims of SOTA performance rest on training the model and evaluating against external baselines rather than any self-definition, fitted-input-as-prediction, or self-citation load-bearing step. Dataset construction addresses data scarcity in a standard non-circular manner and does not force the reported outcomes.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim rests on a domain assumption about leveraging source video attributes for coherence plus standard deep-learning training assumptions; no new physical entities are postulated.

free parameters (1)
  • keyframe selection and conditioning strength
    Hyperparameters controlling which frames serve as identity injection points and their weighting during temporal stitching, chosen during model training.
axioms (1)
  • domain assumption Rich visual attributes from source videos can be leveraged via keyframe conditioning to enhance fidelity and temporal coherence in face swapping.
    Core motivating insight stated in the abstract that underpins the architectural design.

pith-pipeline@v0.9.0 · 5525 in / 1228 out tokens · 37209 ms · 2026-05-16T23:53:57.044134+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 9 internal anchors

  1. [1]

    L2cs-net: Fine-grained gaze estimation in unconstrained environments

    Ahmed A Abdelrahman, Thorsten Hempel, Aly Khalifa, Ay- oub Al-Hamadi, and Laslo Dinges. L2cs-net: Fine-grained gaze estimation in unconstrained environments. In2023 8th International Conference on Frontiers of Signal Processing (ICFSP), pages 98–102. IEEE, 2023. 7

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 7

  3. [3]

    Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 22560–22570, 2023. 3

  4. [4]

    Simswap: An efficient framework for high fidelity face swapping

    Renwang Chen, Xuanhong Chen, Bingbing Ni, and Yanhao Ge. Simswap: An efficient framework for high fidelity face swapping. InProceedings of the 28th ACM International Conference on Multimedia, 2020. 1, 3, 6, 7

  5. [5]

    Hifivfs: High fidelity video face swapping

    Xu Chen, Keke He, Junwei Zhu, Yanhao Ge, Wei Li, and Chengjie Wang. Hifivfs: High fidelity video face swapping. arXiv preprint arXiv:2411.18293, 2024. 2, 3, 6, 7, 9

  6. [6]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 3

  7. [7]

    https://github.com/deepfakes/face- swap.Ac- cessed: 2020-12-20, 2020

    DeepFakes. https://github.com/deepfakes/face- swap.Ac- cessed: 2020-12-20, 2020. 1, 3

  8. [8]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2, 3

  9. [9]

    Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set

    Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019. 7

  10. [10]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  11. [11]

    Information bottleneck disentanglement for identity swapping

    Gege Gao, Huaibo Huang, Chaoyou Fu, Zhaoyang Li, and Ran He. Information bottleneck disentanglement for identity swapping. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021. 7

  12. [12]

    Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023

    Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023. 2, 3

  13. [13]

    Keyframe-guided creative video inpainting

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Chenlin Meng, Omer Bar-Tal, Shuangrui Ding, Maneesh Agrawala, Dahua Lin, and Bo Dai. Keyframe-guided creative video inpainting. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13009–13020, 2025. 3

  14. [14]

    Face adapter for pre-trained diffusion models with fine-grained id and attribute control.arXiv preprint arXiv:2405.12970, 2024

    Yue Han, Junwei Zhu, Keke He, Xu Chen, Yanhao Ge, Wei Li, Xiangtai Li, Jiangning Zhang, Chengjie Wang, and Yong Liu. Face adapter for pre-trained diffusion models with fine-grained id and attribute control.arXiv preprint arXiv:2405.12970, 2024. 2, 3, 7, 9

  15. [15]

    Facefusion: Industry leading face manipulation platform, 2025

    Ruhs Henry. Facefusion: Industry leading face manipulation platform, 2025. Accessed: 2025-09-23. 4, 5, 6, 7, 2

  16. [16]

    Delta de- noising score

    Amir Hertz, Kfir Aberman, and Daniel Cohen-Or. Delta de- noising score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2328–2337, 2023. 2, 3

  17. [17]

    Animate anyone 2: High-fidelity character image animation with environment affordance.arXiv preprint arXiv:2502.06145, 2025

    Li Hu, Guangyuan Wang, Zhen Shen, Xin Gao, Dechao Meng, Lian Zhuo, Peng Zhang, Bang Zhang, and Liefeng Bo. Animate anyone 2: High-fidelity character image animation with environment affordance.arXiv preprint arXiv:2502.06145, 2025. 2, 3

  18. [18]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 3

  19. [19]

    VACE: All-in-One Video Creation and Editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025. 2, 3, 4, 5, 7

  20. [20]

    Direct inversion: Boosting diffusion-based edit- ing with 3 lines of code.arXiv preprint arXiv:2310.01506,

    Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based edit- ing with 3 lines of code.arXiv preprint arXiv:2310.01506,

  21. [21]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

  22. [22]

    Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation

    Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, et al. Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7752–7762, 2025. 7

  23. [23]

    Faceshifter: Towards high fidelity and occlusion aware face swapping.arXiv preprint arXiv:1912.13457, 2019

    Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang Wen. Faceshifter: Towards high fidelity and occlusion aware face swapping.arXiv preprint arXiv:1912.13457, 2019. 1, 3

  24. [24]

    Canonswap: High- fidelity and consistent video face swapping via canonical space modulation.arXiv preprint arXiv:2507.02691, 2025

    Xiangyang Luo, Ye Zhu, Yunfei Liu, Lijian Lin, Cong Wan, Zijian Cai, Shao-Lun Huang, and Yu Li. Canonswap: High- fidelity and consistent video face swapping via canonical space modulation.arXiv preprint arXiv:2507.02691, 2025. 1, 3, 7

  25. [25]

    Pexels: Free stock photos and videos, 2025

    Pexels. Pexels: Free stock photos and videos, 2025. Ac- cessed: 2025-11-20. 7

  26. [26]

    Pixabay: Free images and videos, 2025

    Pixabay. Pixabay: Free images and videos, 2025. Accessed: 2025-11-20. 7

  27. [27]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 2, 3

  28. [28]

    Fatezero: Fus- ing attentions for zero-shot text-based video editing

    Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fus- ing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023. 2, 3

  29. [29]

    Faceforen- sics++: Learning to detect manipulated facial images

    Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Chris- tian Riess, Justus Thies, and Matthias Nießner. Faceforen- sics++: Learning to detect manipulated facial images. In Proceedings of the IEEE International Conference on Com- puter Vision, 2019. 6

  30. [30]

    Fine- grained head pose estimation without keypoints

    Nataniel Ruiz, Eunji Chong, and James M Rehg. Fine- grained head pose estimation without keypoints. InProceed- ings of the IEEE conference on computer vision and pattern recognition workshops, 2018. 7

  31. [31]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 3

  32. [32]

    Blendface: Re-designing identity encoders for face- swapping

    Kaede Shiohara, Xingchao Yang, and Takafumi Take- tomi. Blendface: Re-designing identity encoders for face- swapping. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023. 1, 3, 7

  33. [33]

    Videoanydoor: High-fidelity video ob- ject insertion with precise motion control

    Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, and Hengshuang Zhao. Videoanydoor: High-fidelity video ob- ject insertion with precise motion control. InProceedings of the Special Interest Group on Computer Graphics and In- teractive Techniques Conference Conference Papers, pages 1–11, 2025. 2, 3

  34. [34]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 7

  35. [35]

    Cosface: Large margin cosine loss for deep face recognition

    Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, 2018. 7

  36. [36]

    Dy- namicface: High-quality and consistent video face swap- ping using composable 3d facial priors.arXiv preprint arXiv:2501.08553, 2025

    Runqi Wang, Sijie Xu, Tianyao He, Yang Chen, Wei Zhu, Dejia Song, Nemo Chen, Xu Tang, and Yao Hu. Dy- namicface: High-quality and consistent video face swap- ping using composable 3d facial priors.arXiv preprint arXiv:2501.08553, 2025. 2, 3, 6

  37. [37]

    Framer: Interactive frame interpolation.arXiv preprint arXiv:2410.18978, 2024

    Wen Wang, Qiuyu Wang, Kecheng Zheng, Hao Ouyang, Zhekai Chen, Biao Gong, Hao Chen, Yujun Shen, and Chun- hua Shen. Framer: Interactive frame interpolation.arXiv preprint arXiv:2410.18978, 2024. 3

  38. [38]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 2, 3

  39. [39]

    Vfhq: A high-quality dataset and bench- mark for video face super-resolution

    Liangbin Xie, Xintao Wang, Honglun Zhang, Chao Dong, and Ying Shan. Vfhq: A high-quality dataset and bench- mark for video face super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 6, 5

  40. [40]

    Lora-composer: Leveraging low-rank adap- tation for multi-concept customization in training-free diffu- sion models.arXiv preprint arXiv:2403.11627, 2024

    Yang Yang, Wen Wang, Liang Peng, Chaotian Song, Yao Chen, Hengjia Li, Xiaolong Yang, Qinglin Lu, Deng Cai, Boxi Wu, et al. Lora-composer: Leveraging low-rank adap- tation for multi-concept customization in training-free diffu- sion models.arXiv preprint arXiv:2403.11627, 2024. 3

  41. [41]

    Any-to-bokeh: One-step video bokeh via multi-plane image guided diffu- sion.arXiv preprint arXiv:2505.21593, 2025

    Yang Yang, Siming Zheng, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, and Peng-Tao Jiang. Any-to-bokeh: One-step video bokeh via multi-plane image guided diffu- sion.arXiv preprint arXiv:2505.21593, 2025. 3

  42. [42]

    Object-aware inver- sion and reassembly for image editing.arXiv preprint arXiv:2310.12149, 2023

    Zhen Yang, Ganggui Ding, Wen Wang, Hao Chen, Bo- han Zhuang, and Chunhua Shen. Object-aware inver- sion and reassembly for image editing.arXiv preprint arXiv:2310.12149, 2023. 2

  43. [43]

    Effec- tive whole-body pose estimation with two-stages distillation

    Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effec- tive whole-body pose estimation with two-stages distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023. 5

  44. [44]

    Bisenet: Bilateral segmentation network for real-time semantic segmentation

    Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. InProceed- ings of the European conference on computer vision (ECCV),

  45. [45]

    Celebv-text: A large-scale facial text-video dataset

    Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Wei- dong Cai, and Wayne Wu. Celebv-text: A large-scale facial text-video dataset. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2023. 6, 5

  46. [46]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 5

  47. [47]

    Diception: A generalist diffusion model for visual perceptual tasks, 2025

    Canyu Zhao, Yanlong Sun, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, and Chun- hua Shen. Diception: A generalist diffusion model for visual perceptual tasks, 2025. 5

  48. [48]

    3d object manipulation in a single image using generative models.arXiv preprint arXiv:2501.12935, 2025

    Ruisi Zhao, Zechuan Zhang, Zongxin Yang, and Yi Yang. 3d object manipulation in a single image using generative models.arXiv preprint arXiv:2501.12935, 2025. 3

  49. [49]

    Diffswap: High-fidelity and con- trollable face swapping via 3d-aware masked diffusion

    Wenliang Zhao, Yongming Rao, Weikang Shi, Zuyan Liu, Jie Zhou, and Jiwen Lu. Diffswap: High-fidelity and con- trollable face swapping via 3d-aware masked diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2023. 2, 3, 7

  50. [50]

    Local conditional controlling for text-to-image diffu- sion models

    Yibo Zhao, Liang Peng, Yang Yang, Zekai Luo, Hengjia Li, Yao Chen, Zheng Yang, Xiaofei He, Wei Zhao, Qinglin Lu, et al. Local conditional controlling for text-to-image diffu- sion models. InProceedings of the AAAI Conference on Ar- tificial Intelligence, pages 10492–10500, 2025. 3

  51. [51]

    Unleash- ing the potential of the diffusion model in few-shot semantic segmentation.Advances in Neural Information Processing Systems, 37:42672–42695, 2024

    Muzhi Zhu, Yang Liu, Zekai Luo, Chenchen Jing, Hao Chen, Guangkai Xu, Xinlong Wang, and Chunhua Shen. Unleash- ing the potential of the diffusion model in few-shot semantic segmentation.Advances in Neural Information Processing Systems, 37:42672–42695, 2024. 3 Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality Supplementary...