Unmasking Puppeteers: Leveraging Biometric Leakage to Expose Impersonation in AI-Based Videoconferencing
Pith reviewed 2026-05-18 09:53 UTC · model grok-4.3
The pith
The pose-expression latent in AI talking-head videoconferencing carries persistent biometric identity information that can be isolated to detect impersonation without inspecting the output video.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We observe that the pose-expression latent inherently contains biometric information of the driving identity. We introduce a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered.
What carries the argument
pose-conditioned large-margin contrastive encoder that isolates persistent identity cues inside the transmitted pose-expression latent while suppressing transient pose and expression variations
If this is right
- Enables real-time detection of puppeteering attacks directly on the transmitted latent.
- Outperforms prior puppeteering defenses across multiple talking-head generation models.
- Generalizes to out-of-distribution driving identities and poses without retraining.
- Operates without ever accessing or reconstructing the final RGB video frames.
- Maintains low latency suitable for live videoconferencing pipelines.
Where Pith is reading between the lines
- The same leakage principle may apply to other latent-based generative systems that transmit intermediate representations instead of full frames.
- This approach could be combined with encryption of the latent stream to add a lightweight identity verification layer to existing video codecs.
- Auditing generative latents for hidden attributes without reconstructing outputs may become a general technique for securing other synthesis pipelines.
- If the contrastive training succeeds without RGB supervision, similar methods might extract other persistent attributes such as age or ethnicity from driving signals.
Load-bearing premise
Biometric identity information persists in the pose-expression latent independently of transient pose and expression variations and can be reliably isolated by a pose-conditioned contrastive encoder trained without direct access to reconstructed RGB frames.
What would settle it
An experiment that drives the same pose sequence with two different identities, extracts the identity embeddings, and checks whether their cosine similarity remains low; if the embeddings fail to separate reliably, the detection method collapses.
Figures
read the original abstract
AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim's likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that AI-based talking-head videoconferencing systems are vulnerable to real-time puppeteering attacks because the transmitted pose-expression latent can be manipulated to hijack a victim's likeness. It observes that this latent inherently leaks biometric information of the driving identity and introduces a pose-conditioned large-margin contrastive encoder to isolate persistent identity cues while cancelling transient pose and expression variations. Detection of illicit swaps is performed via a cosine similarity test on the resulting embedding, without ever accessing the reconstructed RGB video. Experiments are said to demonstrate consistent outperformance over existing defenses, real-time operation, and strong generalization to out-of-distribution cases across multiple talking-head models.
Significance. If the central claim holds, the work would be significant for addressing a practical security gap in real-time video communications where standard deepfake detectors fail on fully synthetic outputs. The direct use of the transmitted latent for biometric isolation, without RGB reconstruction, offers an efficient and deployable defense. Credit is given for the focus on real-time applicability and reported OOD generalization, which could inform future protocol designs if supported by rigorous validation.
major comments (3)
- [Abstract] Abstract: The claim of 'consistent outperformance on multiple talking-head models and generalization to out-of-distribution cases' is presented without any quantitative metrics, baselines, or experimental controls. This is load-bearing for the central claim, as it prevents assessment of whether the contrastive encoder reliably supports detection.
- [Method] Method section (encoder description): The pose-conditioned large-margin contrastive encoder is asserted to isolate persistent identity while cancelling transient pose/expression, but no explicit invariance verification (e.g., embedding variance under controlled pose changes for fixed identity) or ablation on the conditioning is provided. Without this, detected differences in cosine tests could arise from pose mismatch rather than identity swap, undermining the defense.
- [Experiments] Experiments section: The manuscript reports outperformance and real-time operation but omits specifics on training details, loss formulation, dataset splits, or numerical results (e.g., detection rates, false positives). This leaves the soundness of the biometric leakage exploitation unverified.
minor comments (2)
- [Method] Clarify the precise integration of pose conditioning into the encoder (e.g., via concatenation or modulation) and the margin value in the contrastive loss for reproducibility.
- [Related Work] Add a reference to prior work on latent-space identity leakage in generative models to better contextualize novelty.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments highlight important areas where additional clarity and evidence will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim of 'consistent outperformance on multiple talking-head models and generalization to out-of-distribution cases' is presented without any quantitative metrics, baselines, or experimental controls. This is load-bearing for the central claim, as it prevents assessment of whether the contrastive encoder reliably supports detection.
Authors: We agree that the abstract would be more informative with key quantitative support for the performance claims. In the revised manuscript we will add concise numerical highlights drawn from the experiments (e.g., detection accuracy, FPR, and relative gains versus baselines) while preserving the abstract's brevity. This change will allow readers to evaluate the central claim without first consulting the full experimental section. revision: yes
-
Referee: [Method] Method section (encoder description): The pose-conditioned large-margin contrastive encoder is asserted to isolate persistent identity while cancelling transient pose/expression, but no explicit invariance verification (e.g., embedding variance under controlled pose changes for fixed identity) or ablation on the conditioning is provided. Without this, detected differences in cosine tests could arise from pose mismatch rather than identity swap, undermining the defense.
Authors: We acknowledge the value of explicit verification. The current manuscript describes the conditioning mechanism and loss but does not include a dedicated invariance study or conditioning ablation. We will add both: (1) quantitative embedding variance measurements for fixed identities across controlled pose/expression variations, and (2) an ablation removing the pose-conditioning input. These additions will directly demonstrate that identity separation is not driven by residual pose mismatch. revision: yes
-
Referee: [Experiments] Experiments section: The manuscript reports outperformance and real-time operation but omits specifics on training details, loss formulation, dataset splits, or numerical results (e.g., detection rates, false positives). This leaves the soundness of the biometric leakage exploitation unverified.
Authors: We agree that the experimental section requires greater specificity for reproducibility and verification. The revision will expand this section to include the precise training hyperparameters, the full contrastive loss formulation, dataset split details, and complete numerical results (detection rates, false-positive rates, latency measurements, and per-model comparisons). We will also add explicit controls and metrics for the out-of-distribution generalization experiments. revision: yes
Circularity Check
No significant circularity; central defense relies on trained encoder from empirical observation
full rationale
The paper's derivation begins from the stated observation that the pose-expression latent contains biometric identity information and proceeds by introducing a new pose-conditioned large-margin contrastive encoder trained to isolate persistent cues while cancelling transients. This is not an algebraic reduction, fitted-input prediction, or self-citation chain that collapses back to the inputs by construction; the encoder is presented as an independently trained component whose parameters are learned rather than predefined from the target result. No equations or prior-author uniqueness theorems are invoked in the provided description to force the outcome, and the method is self-contained as a machine-learned detector evaluated on multiple models.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The pose-expression latent inherently contains biometric information of the driving identity that persists independently of pose and expression.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Hao Wang and Li Zhang. Synthesizing realistic avatars for enhanced virtual communication.IEEE Transactions on Visualization and Computer Graphics, 29(5):1802–1815, 2023. doi: 10.1109/TVCG.2023. 3214567. URLhttps://ieeexplore.ieee.org/document/3214567
-
[2]
Generative adversarial networks for hyper-realistic avatar creation
Min-Jun Lee and Soo-Young Kim. Generative adversarial networks for hyper-realistic avatar creation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1234–1243, 2022. doi: 10.1109/CVPR.2022.001234. URL https://ieeexplore.ieee.org/ document/001234
-
[3]
Deep learning techniques for avatar-based interaction in virtual environments
John Smith and Jane Doe. Deep learning techniques for avatar-based interaction in virtual environments. IEEE Transactions on Neural Networks and Learning Systems, 32(12):5600–5612, 2021. doi: 10.1109/ TNNLS.2021.3071234. URLhttps://ieeexplore.ieee.org/document/3071234
-
[4]
Ai-mediated 3d video conferencing
Michael Stengel, Koki Nagano, Chao Liu, Matthew Chan, Alex Trevithick, Shalini De Mello, Jonghyun Kim, and David Luebke. Ai-mediated 3d video conferencing. InACM SIGGRAPH Emerging Technologies,
-
[5]
doi: 10.1145/3588037.3595385
- [6]
-
[7]
Zhengang Li, Sheng Lin, Shan Liu, Songnan Li, Xue Lin, Wei Wang, and Wei Jiang. Faivconf: Face enhancement for ai-based video conference with low bit-rate.arXiv preprint arXiv:2207.04090, 2022
-
[8]
Ross Cutler, Ramin Mehran, Sam Johnson, Cha Zhang, Adam Kirk, Oliver Whyte, and Adarsh Kowdle. Multimodal active speaker detection and virtual cinematography for video conferencing.arXiv preprint arXiv:2002.03977, 2020. 10
-
[9]
Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferencing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10039–10049, 2021. doi: 10.1109/CVPR46437.2021.00991
-
[10]
Haonan Tong, Haopeng Li, Hongyang Du, Zhaohui Yang, Changchuan Yin, and Dusit Niyato. Multimodal semantic communication for generative audio-driven video conferencing.arXiv preprint arXiv:2410.22112, 2024
-
[11]
Defending low-bandwidth talking head videoconferencing systems from real-time puppeteering attacks
Danial Samadi Vahdati, Tai Duc Nguyen, and Matthew C Stamm. Defending low-bandwidth talking head videoconferencing systems from real-time puppeteering attacks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 983–992, 2023
work page 2023
-
[12]
Avatar fingerprinting for authorized use of synthetic talking-head videos
Ekta Prashnani, Koki Nagano, Shalini De Mello, David Luebke, and Orazio Gallo. Avatar fingerprinting for authorized use of synthetic talking-head videos. InEuropean Conference on Computer Vision, pages 209–228. Springer, 2024
work page 2024
-
[13]
Combining efficientnet and vision transformers for video deepfake detection
Davide Alessandro Coccomini, Nicola Messina, Claudio Gennaro, and Fabrizio Falchi. Combining efficientnet and vision transformers for video deepfake detection. InInternational conference on image analysis and processing, pages 219–229. Springer, 2022
work page 2022
-
[14]
Video face manipulation detection through ensemble of cnns
Nicolo Bonettini, Edoardo Daniele Cannas, Sara Mandelli, Luca Bondi, Paolo Bestagini, and Stefano Tubaro. Video face manipulation detection through ensemble of cnns. In2020 25th international conference on pattern recognition (ICPR), pages 5012–5019. IEEE, 2021
work page 2021
-
[15]
Tall: Thumbnail layout for deepfake video detection
Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, and Ran He. Tall: Thumbnail layout for deepfake video detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 22658–22668, 2023
work page 2023
-
[16]
Supervised contrastive learning for generalizable and explainable deepfakes detection
Ying Xu, Kiran Raja, and Marius Pedersen. Supervised contrastive learning for generalizable and explainable deepfakes detection. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 379–389, 2022
work page 2022
-
[17]
Dat Nguyen, Nesryne Mejri, Inder Pal Singh, Polina Kuleshova, Marcella Astrid, Anis Kacem, Enjie Ghorbel, and Djamila Aouada. Laa-net: Localized artifact attention network for quality-agnostic and generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17395–17405, 2024
work page 2024
-
[18]
Videofact: detecting video forgeries using attention, scene context, and forensic traces
Tai D Nguyen, Shengbang Fang, and Matthew C Stamm. Videofact: detecting video forgeries using attention, scene context, and forensic traces. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8563–8573, 2024
work page 2024
-
[19]
Nguyen, Aref Azizpour, and Matthew C
Danial Samadi Vahdati, Tai D. Nguyen, Aref Azizpour, and Matthew C. Stamm. Beyond deepfake images: Detecting ai-generated videos.arXiv preprint arXiv:2404.15955, 2024
-
[20]
Lichuan Ji, Yingqi Lin, Zhenhua Huang, Yan Han, Xiaogang Xu, Jiafei Wu, Chong Wang, and Zhe Liu. Distinguish any fake videos: Unleashing the power of large-scale data and motion features.arXiv preprint arXiv:2405.15343, 2024
-
[21]
What matters in detecting ai-generated videos like sora?
Chirui Chang, Zhengzhe Liu, Xiaoyang Lyu, and Xiaojuan Qi. What matters in detecting ai-generated videos like sora?arXiv preprint arXiv:2406.19568, 2024
-
[22]
Yuezun Li, Ming-Ching Chang, and Siwei Lyu. Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content.arXiv preprint arXiv:2412.12278, 2024
-
[23]
Xuangeng Chu and Tatsuya Harada. Generalizable and animatable gaussian head avatar.Advances in Neural Information Processing Systems, 37:57642–57670, 2024
work page 2024
-
[24]
Emoca: Emotion driven monocular face capture and animation
Radek Danˇeˇcek, Michael J Black, and Timo Bolkart. Emoca: Emotion driven monocular face capture and animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20311–20322, 2022
work page 2022
-
[25]
Audio2head: Audio-driven one-shot talking-head generation with natural head motion
Suzhen Wang, Lincheng Li, Yu Ding, Changjie Fan, and Xin Yu. Audio2head: Audio-driven one-shot talking-head generation with natural head motion. InProceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI), pages 1107–1113, 2021. doi: 10.24963/ijcai.2021/152
-
[26]
Emotalker: Audio driven emotion aware talking head generation
Xiaoqian Shen, Yantong Wang, Zhenhua Liu, and Zhiyong Wang. Emotalker: Audio driven emotion aware talking head generation. InProceedings of the 17th Asian Conference on Computer Vision (ACCV), pages 123–137, 2024. doi: 10.1007/978-3-031-68418-0_9. 11
-
[27]
Trevine Oorloff and Yaser Yacoob. Expressive talking head video encoding in stylegan2 latent space.arXiv preprint arXiv:2203.14512, 2022
-
[28]
Talking-head generation with rhythmic head motion
Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. Talking-head generation with rhythmic head motion. InProceedings of the European Conference on Computer Vision (ECCV), pages 35–51, 2020
work page 2020
-
[29]
Madhav Agarwal, Anchit Gupta, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V . Jawahar. Compressing video calls using synthetic talking heads. InProceedings of the British Machine Vision Conference (BMVC), pages 1–12, 2021
work page 2021
-
[30]
Depth-aware generative adversarial network for talking head video generation
Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu. Depth-aware generative adversarial network for talking head video generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3397–3406, 2022
work page 2022
-
[31]
Fa-Ting Hong, Li Shen, and Dan Xu. Dagan++: Depth-aware generative adversarial network for talking head video generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):2997–3012, 2023
work page 2023
-
[32]
Implicit warping for animation with image sets
Arun Mallya, Ting-Chun Wang, and Ming-Yu Liu. Implicit warping for animation with image sets. Advances in Neural Information Processing Systems, 35:22438–22450, 2022
work page 2022
-
[33]
Yue Wu, Yu Deng, Jiaolong Yang, Fangyun Wei, Qifeng Chen, and Xin Tong. Anifacegan: Animatable 3d-aware face image generation for video avatars.Advances in Neural Information Processing Systems, 35:36188–36201, 2022
work page 2022
-
[34]
Talking face generation with multilingual tts
Hyoung-Kyu Song, Sang Hoon Woo, Junhyeok Lee, Seungmin Yang, Hyunjae Cho, Youseong Lee, Dongho Choi, and Kang-wook Kim. Talking face generation with multilingual tts. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 21425–21430, 2022
work page 2022
-
[35]
High-fidelity and freely controllable talking head video generation
Yue Gao, Yuan Zhou, Jinglu Wang, Xiao Li, Xiang Ming, and Yan Lu. High-fidelity and freely controllable talking head video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10039–10049, 2023. doi: 10.1109/CVPR2023.00991
-
[36]
Geumbyeol Hwang, Sunwon Hong, Seunghyun Lee, Sungwoo Park, and Gyeongsu Chae. Discohead: Audio-and-video-driven talking head generation by disentangled control of head pose and facial expressions. arXiv preprint arXiv:2303.07697, 2023
-
[37]
Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, Bing Yin, Cong Liu, and Qingfeng Liu. Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion.arXiv preprint arXiv:2411.16726, 2024
-
[38]
Interactive conversational head generation.arXiv preprint arXiv:2307.02090, 2023
Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, Bing Yin, Cong Liu, and Qingfeng Liu. Interactive conversational head generation.arXiv preprint arXiv:2307.02090, 2023
-
[39]
Zhiling Ye, LiangGuo Zhang, Dingheng Zeng, Quan Lu, and Ning Jiang. R2-talker: Realistic real-time talking head synthesis with hash grid landmarks encoding and progressive multilayer conditioning.arXiv preprint arXiv:2312.05572, 2023
-
[40]
Maxime Oquab, Daniel Haziza, Ludovic Schwartz, Tao Xu, Katayoun Zand, Rui Wang, Peirong Liu, and Camille Couprie. Efficient conditioned face animation using frontally-viewed embedding.arXiv preprint arXiv:2203.08765, 2022
-
[41]
arXiv preprint arXiv:2301.12345 , year =
John Doe, Jane Smith, and Alan Turing. Txt2vid: Ultra-low bitrate compression of talking-head videos via text.arXiv preprint arXiv:2301.12345, 2023
-
[42]
Goldman, Supreeth Achar, Gregory Major Blascovich, Joseph G
Jason Lawrence, Dan B. Goldman, Supreeth Achar, Gregory Major Blascovich, Joseph G. Desloge, Tommy Fortes, Eric M. Gomez, Sascha Häberling, Hugues Hoppe, Andy Huibers, Claude Knaus, Brian Kuschak, Ricardo Martin-Brualla, Harris Nover, Andrew Ian Russell, Steven M. Seitz, and Kevin Tong. Project starline: A high-fidelity telepresence system.ACM Transaction...
-
[43]
Gil Knafo and Ohad Fried. Fakeout: Leveraging out-of-domain self-supervision for multi-modal video deepfake detection.arXiv preprint arXiv:2212.00773, 2022. 12
-
[44]
Emerging properties in self-supervised vision transformers
Shu Hu, Yuezun Li, and Siwei Lyu. Exposing gan-generated faces using inconsistent corneal specular highlights. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1506–1515, 2021. doi: 10.1109/ICCV48922.2021.00154
-
[45]
Detecting deep-fake videos from phoneme- viseme mismatches
Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Detecting deep-fake videos from phoneme- viseme mismatches. InProceedings of the 27th ACM International Conference on Multimedia (MM), pages 1136–1145, 2018. doi: 10.1145/3343031.3350928
-
[46]
Deeprhythm: Exposing deepfakes with attentional visual heartbeat rhythms
Yuezun Li, Ming-Ching Chang, and Siwei Lyu. Deeprhythm: Exposing deepfakes with attentional visual heartbeat rhythms. InProceedings of the 28th ACM International Conference on Multimedia (MM), pages 1411–1419, 2020. doi: 10.1145/3394171.3413651
-
[47]
Emerging properties in self-supervised vision transformers
Ilke Demir, Umur Aybars Ciftci, and Lijun Yin. Fakecatcher: Detection of synthetic portrait videos using biological signals. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1432–1441, 2021. doi: 10.1109/ICCV48922.2021.00147
-
[48]
Capturing the lighting inconsistency for deepfake detection
Wei-Ming Zhang, Xue-Jie Zhang, and Yu-Feng Li. Capturing the lighting inconsistency for deepfake detection. InProceedings of the European Conference on Computer Vision (ECCV), pages 812–828, 2022. doi: 10.1007/978-3-031-06788-4_52
-
[49]
Illumination enlightened spatial-temporal inconsistency for deepfake video detection
Wei-Ming Zhang, Xue-Jie Zhang, and Yu-Feng Li. Illumination enlightened spatial-temporal inconsistency for deepfake video detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12345–12354, 2023. doi: 10.1109/CVPR2023.01234
-
[50]
Wei-Ming Zhang, Xue-Jie Zhang, and Yu-Feng Li. Lideepdet: Deepfake detection via image decomposition and inconsistency analysis.Electronics, 13(22):4466, 2023. doi: 10.3390/electronics13224466
-
[51]
Andrea Ciamarra, Roberto Caldelli, Federico Becattini, Lorenzo Seidenari, and Alberto Del Bimbo. Deep- fake detection by exploiting surface anomalies: The surfake approach.arXiv preprint arXiv:2310.20621, 2023
-
[52]
IEEE Transactions on Multimedia 25, 942– 952 (2020) https://doi.org/10.1109/tmm
Zongmei Chen, Xin Liao, and Xiaoshuai Wu. Aim-bone: Texture discrepancy generation and localization for generalized deepfake detection.IEEE Transactions on Multimedia, 25:1–13, 2023. doi: 10.1109/TMM. 2023.3245678
work page doi:10.1109/tmm 2023
-
[53]
Zhiwei Xiong, Wei Wang, and Xiaochun Cao. Deepfake detection and localization using multi-view inconsistency measurement.IEEE Transactions on Multimedia, 25:1–12, 2023. doi: 10.1109/TMM.2023. 3234567
-
[54]
Artifacts-disentangled adversarial learning for deepfake detection
Haodong Li, Bin Li, and Shunquan Tan. Artifacts-disentangled adversarial learning for deepfake detection. IEEE Transactions on Information Forensics and Security, 17:1–14, 2022. doi: 10.1109/TIFS.2022. 3142356
-
[55]
Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PloS one, 13(5):e0196391, 2018
work page 2018
-
[56]
Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. Crema-d: Crowd-sourced emotional multimodal actors dataset.IEEE transactions on affective computing, 5(4):377–390, 2014
work page 2014
-
[57]
Shuling Zhao, Fa-Ting Hong, Xiaoshui Huang, and Dan Xu. Synergizing motion and appearance: Multi- scale compensatory codebooks for talking head video generation.arXiv preprint arXiv:2412.00719, 2024
-
[58]
Shuai Tan, Bin Ji, Yu Ding, and Ye Pan. Say anything with any style. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5088–5096, 2024
work page 2024
-
[59]
Cvthead: One-shot con- trollable head avatar with vertex-feature transformer
Haoyu Ma, Tong Zhang, Shanlin Sun, Xiangyi Yan, Kun Han, and Xiaohui Xie. Cvthead: One-shot con- trollable head avatar with vertex-feature transformer. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6131–6141, 2024
work page 2024
-
[60]
Learning dynamic facial radiance fields for few-shot talking head synthesis
Shuai Shen, Wanhua Li, Zheng Zhu, Yueqi Duan, Jie Zhou, and Jiwen Lu. Learning dynamic facial radiance fields for few-shot talking head synthesis. InEuropean conference on computer vision, pages 666–682. Springer, 2022
work page 2022
-
[61]
Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8652–8661, 2023. 13
work page 2023
-
[62]
Yaohui Wang, Di Yang, Francois Bremond, and Antitza Dantcheva. Latent image animator: Learning to animate images via latent space navigation.arXiv preprint arXiv:2203.09043, 2022
-
[63]
Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. Audio-visual face reenactment. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5178–5187, 2023
work page 2023
-
[64]
Cosface: Large margin cosine loss for deep face recognition
Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5265–5274, 2018
work page 2018
-
[65]
Adaface: Quality adaptive margin for face recognition
Minchul Kim, Anil K Jain, and Xiaoming Liu. Adaface: Quality adaptive margin for face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18750–18759, 2022
work page 2022
-
[66]
Generalization analysis for contrastive representation learning
Yunwen Lei, Tianbao Yang, Yiming Ying, and Ding-Xuan Zhou. Generalization analysis for contrastive representation learning. InInternational Conference on Machine Learning, pages 19200–19227. PMLR, 2023
work page 2023
-
[67]
Understanding contrastive learning through the lens of margins.arXiv preprint arXiv:2306.11526, 2023
Daniel Rho, TaeSoo Kim, Sooill Park, Jaehyun Park, and JaeHan Park. Understanding contrastive learning through the lens of margins.arXiv preprint arXiv:2306.11526, 2023
-
[68]
Junshu Tang, Bo Zhang, Binxin Yang, Ting Zhang, Dong Chen, Lizhuang Ma, and Fang Wen. 3dfaceshop: Explicitly controllable 3d-aware portrait generation.IEEE transactions on visualization and computer graphics, 30(9):6020–6037, 2023
work page 2023
-
[69]
Fa-Ting Hong and Dan Xu. Implicit identity representation conditioned memory compensation network for talking head video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23062–23072, 2023
work page 2023
-
[70]
Emoportraits: Emotion-enhanced multimodal one-shot head avatars
Nikita Drobyshev, Antoni Bigata Casademunt, Konstantinos V ougioukas, Zoe Landgraf, Stavros Petridis, and Maja Pantic. Emoportraits: Emotion-enhanced multimodal one-shot head avatars. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8498–8507, 2024
work page 2024
-
[71]
Stella Bounareli, Vasileios Argyriou, and Georgios Tzimiropoulos. Finding directions in gan’s latent space for neural face reenactment.arXiv preprint arXiv:2202.00046, 2022
-
[72]
Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168, 2024. 14 Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing Supplementary Material A. Addit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.