Recognition: unknown
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
Pith reviewed 2026-05-16 23:53 UTC · model grok-4.3
The pith
LivingSwap conditions video face swaps on keyframes and source reference to preserve expressions, lighting, and motion across long sequences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LivingSwap is the first video reference-guided face swapping model that employs keyframes as conditioning signals to inject the target identity and performs temporal stitching by combining keyframe conditioning with video reference guidance, thereby ensuring stable identity preservation and high-fidelity reconstruction of expressions, lighting, and motion across long video sequences.
What carries the argument
Keyframe conditioning signals combined with video reference guidance for temporal stitching.
If this is right
- Target identity integrates with source expressions, lighting, and motion without manual cleanup
- Temporal coherence holds across extended video sequences
- Production workflows require substantially less manual intervention for face replacement
- State-of-the-art fidelity on reference-guided video face swapping benchmarks
Where Pith is reading between the lines
- The same keyframe-plus-reference pattern could be tested on other reference-guided video tasks such as object insertion or style transfer
- If the reversed-pair construction proves robust, similar dataset reversal might improve supervision for other paired video editing problems
- Deployment on longer films would require checking whether drift accumulates beyond the lengths seen in Face2Face training clips
Load-bearing premise
The Face2Face dataset and its reversed pairs supply unbiased ground-truth supervision that generalizes to arbitrary long, complex cinematic videos without artifacts or identity leakage.
What would settle it
Visible identity leakage, loss of source lighting consistency, or temporal artifacts in a long real-world cinematic sequence whose motion and lighting patterns fall outside the distribution of the reversed Face2Face pairs.
Figures
read the original abstract
Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video's expressions, lighting, and motion, while significantly reducing manual effort in production workflows. Project webpage: https://aim-uofa.github.io/LivingSwap
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LivingSwap, the first video reference-guided face swapping model. It uses keyframes as conditioning signals to inject target identity, combines this with video reference guidance, and applies temporal stitching to maintain identity preservation and high-fidelity reconstruction across long video sequences. To enable training, the authors construct a paired Face2Face dataset and reverse the data pairs for ground-truth supervision. The central claim is that the method achieves state-of-the-art results by seamlessly integrating target identity with source expressions, lighting, and motion while reducing manual effort in cinematic production workflows.
Significance. If the performance claims hold under rigorous evaluation, the work could advance practical video face swapping for film and entertainment by improving temporal consistency and realism in complex sequences. The reference-guided approach adapted from image editing to video is a reasonable direction for controllable editing. However, the current lack of quantitative metrics, baselines, and generalization tests makes it difficult to determine whether the contribution meaningfully exceeds existing methods.
major comments (3)
- [Abstract] Abstract: The state-of-the-art claim is unsupported by any quantitative metrics, baseline comparisons, ablation studies, or error analysis, which are required to substantiate performance assertions in a computer vision manuscript on face swapping.
- [Dataset Construction] Dataset Construction: Reversing pairs in the newly constructed Face2Face dataset to create supervision may embed source-specific correlations in lighting, motion statistics, and identity cues rather than producing clean target swaps; this risks overfitting in the keyframe conditioning and temporal stitching modules and undermines generalization claims to arbitrary long cinematic sequences.
- [Method] Method: The keyframe selection and conditioning strength are listed as free parameters without a clear protocol for their determination or robustness analysis across diverse video lengths and complexities, leaving the central controllability claim load-bearing but unverified.
minor comments (2)
- [Abstract] Abstract: The project webpage is referenced but no information on code or model availability is provided, which would aid reproducibility.
- [Introduction] The manuscript would benefit from additional citations to recent video face swapping and temporal consistency methods to better position the novelty of the reference-guided temporal stitching.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point-by-point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The state-of-the-art claim is unsupported by any quantitative metrics, baseline comparisons, ablation studies, or error analysis, which are required to substantiate performance assertions in a computer vision manuscript on face swapping.
Authors: We acknowledge that the SOTA claim requires stronger quantitative backing. The current manuscript prioritizes qualitative visual results and temporal consistency demonstrations across long sequences, but we agree this is insufficient. In the revised version we will add quantitative metrics (e.g., FID, LPIPS, and a temporal consistency score), direct comparisons against recent baselines, and ablation studies on the keyframe conditioning and temporal stitching modules. revision: yes
-
Referee: [Dataset Construction] Dataset Construction: Reversing pairs in the newly constructed Face2Face dataset to create supervision may embed source-specific correlations in lighting, motion statistics, and identity cues rather than producing clean target swaps; this risks overfitting in the keyframe conditioning and temporal stitching modules and undermines generalization claims to arbitrary long cinematic sequences.
Authors: The reversal creates paired supervision by swapping the roles of source and target so the model learns identity injection while preserving the original source attributes; the Face2Face dataset was built from diverse cinematic clips to reduce spurious correlations. We accept that additional validation is warranted and will add generalization tests on out-of-distribution long sequences in the revision. revision: partial
-
Referee: [Method] Method: The keyframe selection and conditioning strength are listed as free parameters without a clear protocol for their determination or robustness analysis across diverse video lengths and complexities, leaving the central controllability claim load-bearing but unverified.
Authors: Keyframe selection uses a motion-threshold protocol (detailed in Section 3.2) and conditioning strength is set via validation-set tuning. We will expand the method description with an explicit protocol and add a robustness subsection reporting performance across video lengths and parameter ranges to verify controllability. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an empirical ML method for video face swapping that constructs the Face2Face dataset and reverses pairs to create supervision signals. No mathematical derivations, equations, or first-principles results are described that reduce by construction to fitted parameters or self-referential inputs. Central claims of SOTA performance rest on training the model and evaluating against external baselines rather than any self-definition, fitted-input-as-prediction, or self-citation load-bearing step. Dataset construction addresses data scarcity in a standard non-circular manner and does not force the reported outcomes.
Axiom & Free-Parameter Ledger
free parameters (1)
- keyframe selection and conditioning strength
axioms (1)
- domain assumption Rich visual attributes from source videos can be leveraged via keyframe conditioning to enhance fidelity and temporal coherence in face swapping.
Reference graph
Works this paper leans on
-
[1]
L2cs-net: Fine-grained gaze estimation in unconstrained environments
Ahmed A Abdelrahman, Thorsten Hempel, Aly Khalifa, Ay- oub Al-Hamadi, and Laslo Dinges. L2cs-net: Fine-grained gaze estimation in unconstrained environments. In2023 8th International Conference on Frontiers of Signal Processing (ICFSP), pages 98–102. IEEE, 2023. 7
work page 2023
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing
Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 22560–22570, 2023. 3
work page 2023
-
[4]
Simswap: An efficient framework for high fidelity face swapping
Renwang Chen, Xuanhong Chen, Bingbing Ni, and Yanhao Ge. Simswap: An efficient framework for high fidelity face swapping. InProceedings of the 28th ACM International Conference on Multimedia, 2020. 1, 3, 6, 7
work page 2020
-
[5]
Hifivfs: High fidelity video face swapping
Xu Chen, Keke He, Junwei Zhu, Yanhao Ge, Wei Li, and Chengjie Wang. Hifivfs: High fidelity video face swapping. arXiv preprint arXiv:2411.18293, 2024. 2, 3, 6, 7, 9
-
[6]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
https://github.com/deepfakes/face- swap.Ac- cessed: 2020-12-20, 2020
DeepFakes. https://github.com/deepfakes/face- swap.Ac- cessed: 2020-12-20, 2020. 1, 3
work page 2020
-
[8]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set
Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019. 7
work page 2019
-
[10]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,
-
[11]
Information bottleneck disentanglement for identity swapping
Gege Gao, Huaibo Huang, Chaoyou Fu, Zhaoyang Li, and Ran He. Information bottleneck disentanglement for identity swapping. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021. 7
work page 2021
-
[12]
Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023. 2, 3
-
[13]
Keyframe-guided creative video inpainting
Yuwei Guo, Ceyuan Yang, Anyi Rao, Chenlin Meng, Omer Bar-Tal, Shuangrui Ding, Maneesh Agrawala, Dahua Lin, and Bo Dai. Keyframe-guided creative video inpainting. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13009–13020, 2025. 3
work page 2025
-
[14]
Yue Han, Junwei Zhu, Keke He, Xu Chen, Yanhao Ge, Wei Li, Xiangtai Li, Jiangning Zhang, Chengjie Wang, and Yong Liu. Face adapter for pre-trained diffusion models with fine-grained id and attribute control.arXiv preprint arXiv:2405.12970, 2024. 2, 3, 7, 9
-
[15]
Facefusion: Industry leading face manipulation platform, 2025
Ruhs Henry. Facefusion: Industry leading face manipulation platform, 2025. Accessed: 2025-09-23. 4, 5, 6, 7, 2
work page 2025
-
[16]
Amir Hertz, Kfir Aberman, and Daniel Cohen-Or. Delta de- noising score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2328–2337, 2023. 2, 3
work page 2023
-
[17]
Li Hu, Guangyuan Wang, Zhen Shen, Xin Gao, Dechao Meng, Lian Zhuo, Peng Zhang, Bang Zhang, and Liefeng Bo. Animate anyone 2: High-fidelity character image animation with environment affordance.arXiv preprint arXiv:2502.06145, 2025. 2, 3
-
[18]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
VACE: All-in-One Video Creation and Editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025. 2, 3, 4, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based edit- ing with 3 lines of code.arXiv preprint arXiv:2310.01506,
-
[21]
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation
Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, et al. Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7752–7762, 2025. 7
work page 2025
-
[23]
Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang Wen. Faceshifter: Towards high fidelity and occlusion aware face swapping.arXiv preprint arXiv:1912.13457, 2019. 1, 3
-
[24]
Xiangyang Luo, Ye Zhu, Yunfei Liu, Lijian Lin, Cong Wan, Zijian Cai, Shao-Lun Huang, and Yu Li. Canonswap: High- fidelity and consistent video face swapping via canonical space modulation.arXiv preprint arXiv:2507.02691, 2025. 1, 3, 7
-
[25]
Pexels: Free stock photos and videos, 2025
Pexels. Pexels: Free stock photos and videos, 2025. Ac- cessed: 2025-11-20. 7
work page 2025
-
[26]
Pixabay: Free images and videos, 2025
Pixabay. Pixabay: Free images and videos, 2025. Accessed: 2025-11-20. 7
work page 2025
-
[27]
DreamFusion: Text-to-3D using 2D Diffusion
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
Fatezero: Fus- ing attentions for zero-shot text-based video editing
Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fus- ing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023. 2, 3
work page 2023
-
[29]
Faceforen- sics++: Learning to detect manipulated facial images
Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Chris- tian Riess, Justus Thies, and Matthias Nießner. Faceforen- sics++: Learning to detect manipulated facial images. In Proceedings of the IEEE International Conference on Com- puter Vision, 2019. 6
work page 2019
-
[30]
Fine- grained head pose estimation without keypoints
Nataniel Ruiz, Eunji Chong, and James M Rehg. Fine- grained head pose estimation without keypoints. InProceed- ings of the IEEE conference on computer vision and pattern recognition workshops, 2018. 7
work page 2018
-
[31]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 3
work page 2023
-
[32]
Blendface: Re-designing identity encoders for face- swapping
Kaede Shiohara, Xingchao Yang, and Takafumi Take- tomi. Blendface: Re-designing identity encoders for face- swapping. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023. 1, 3, 7
work page 2023
-
[33]
Videoanydoor: High-fidelity video ob- ject insertion with precise motion control
Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, and Hengshuang Zhao. Videoanydoor: High-fidelity video ob- ject insertion with precise motion control. InProceedings of the Special Interest Group on Computer Graphics and In- teractive Techniques Conference Conference Papers, pages 1–11, 2025. 2, 3
work page 2025
-
[34]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 7
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[35]
Cosface: Large margin cosine loss for deep face recognition
Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, 2018. 7
work page 2018
-
[36]
Runqi Wang, Sijie Xu, Tianyao He, Yang Chen, Wei Zhu, Dejia Song, Nemo Chen, Xu Tang, and Yao Hu. Dy- namicface: High-quality and consistent video face swap- ping using composable 3d facial priors.arXiv preprint arXiv:2501.08553, 2025. 2, 3, 6
-
[37]
Framer: Interactive frame interpolation.arXiv preprint arXiv:2410.18978, 2024
Wen Wang, Qiuyu Wang, Kecheng Zheng, Hao Ouyang, Zhekai Chen, Biao Gong, Hao Chen, Yujun Shen, and Chun- hua Shen. Framer: Interactive frame interpolation.arXiv preprint arXiv:2410.18978, 2024. 3
-
[38]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Vfhq: A high-quality dataset and bench- mark for video face super-resolution
Liangbin Xie, Xintao Wang, Honglun Zhang, Chao Dong, and Ying Shan. Vfhq: A high-quality dataset and bench- mark for video face super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 6, 5
work page 2022
-
[40]
Yang Yang, Wen Wang, Liang Peng, Chaotian Song, Yao Chen, Hengjia Li, Xiaolong Yang, Qinglin Lu, Deng Cai, Boxi Wu, et al. Lora-composer: Leveraging low-rank adap- tation for multi-concept customization in training-free diffu- sion models.arXiv preprint arXiv:2403.11627, 2024. 3
-
[41]
Yang Yang, Siming Zheng, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, and Peng-Tao Jiang. Any-to-bokeh: One-step video bokeh via multi-plane image guided diffu- sion.arXiv preprint arXiv:2505.21593, 2025. 3
-
[42]
Object-aware inver- sion and reassembly for image editing.arXiv preprint arXiv:2310.12149, 2023
Zhen Yang, Ganggui Ding, Wen Wang, Hao Chen, Bo- han Zhuang, and Chunhua Shen. Object-aware inver- sion and reassembly for image editing.arXiv preprint arXiv:2310.12149, 2023. 2
-
[43]
Effec- tive whole-body pose estimation with two-stages distillation
Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effec- tive whole-body pose estimation with two-stages distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023. 5
work page 2023
-
[44]
Bisenet: Bilateral segmentation network for real-time semantic segmentation
Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. InProceed- ings of the European conference on computer vision (ECCV),
-
[45]
Celebv-text: A large-scale facial text-video dataset
Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Wei- dong Cai, and Wayne Wu. Celebv-text: A large-scale facial text-video dataset. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2023. 6, 5
work page 2023
-
[46]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 5
work page 2023
-
[47]
Diception: A generalist diffusion model for visual perceptual tasks, 2025
Canyu Zhao, Yanlong Sun, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, and Chun- hua Shen. Diception: A generalist diffusion model for visual perceptual tasks, 2025. 5
work page 2025
-
[48]
Ruisi Zhao, Zechuan Zhang, Zongxin Yang, and Yi Yang. 3d object manipulation in a single image using generative models.arXiv preprint arXiv:2501.12935, 2025. 3
-
[49]
Diffswap: High-fidelity and con- trollable face swapping via 3d-aware masked diffusion
Wenliang Zhao, Yongming Rao, Weikang Shi, Zuyan Liu, Jie Zhou, and Jiwen Lu. Diffswap: High-fidelity and con- trollable face swapping via 3d-aware masked diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2023. 2, 3, 7
work page 2023
-
[50]
Local conditional controlling for text-to-image diffu- sion models
Yibo Zhao, Liang Peng, Yang Yang, Zekai Luo, Hengjia Li, Yao Chen, Zheng Yang, Xiaofei He, Wei Zhao, Qinglin Lu, et al. Local conditional controlling for text-to-image diffu- sion models. InProceedings of the AAAI Conference on Ar- tificial Intelligence, pages 10492–10500, 2025. 3
work page 2025
-
[51]
Muzhi Zhu, Yang Liu, Zekai Luo, Chenchen Jing, Hao Chen, Guangkai Xu, Xinlong Wang, and Chunhua Shen. Unleash- ing the potential of the diffusion model in few-shot semantic segmentation.Advances in Neural Information Processing Systems, 37:42672–42695, 2024. 3 Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality Supplementary...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.