pith. sign in

arxiv: 2510.03548 · v4 · submitted 2025-10-03 · 💻 cs.CV · cs.AI

Unmasking Puppeteers: Leveraging Biometric Leakage to Expose Impersonation in AI-Based Videoconferencing

Pith reviewed 2026-05-18 09:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords biometric leakagepuppeteering defensetalking-head synthesiscontrastive encoderidentity detectionlatent space securityvideoconferencingAI video defense
0
0 comments X

The pith

The pose-expression latent in AI talking-head videoconferencing carries persistent biometric identity information that can be isolated to detect impersonation without inspecting the output video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AI videoconferencing systems send a compact latent encoding only pose and expression to reconstruct video at the receiver, but an attacker can replace the driving identity in that latent to puppeteer the victim's face in real time. Standard deepfake detectors fail because every rendered frame is synthetic. The paper establishes that this latent still retains stable biometric cues about the original driver. It introduces a pose-conditioned contrastive encoder trained to pull out those identity cues while discarding pose and expression changes. A cosine similarity check on the resulting embedding then flags when the driver identity has been swapped.

Core claim

We observe that the pose-expression latent inherently contains biometric information of the driving identity. We introduce a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered.

What carries the argument

pose-conditioned large-margin contrastive encoder that isolates persistent identity cues inside the transmitted pose-expression latent while suppressing transient pose and expression variations

If this is right

  • Enables real-time detection of puppeteering attacks directly on the transmitted latent.
  • Outperforms prior puppeteering defenses across multiple talking-head generation models.
  • Generalizes to out-of-distribution driving identities and poses without retraining.
  • Operates without ever accessing or reconstructing the final RGB video frames.
  • Maintains low latency suitable for live videoconferencing pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same leakage principle may apply to other latent-based generative systems that transmit intermediate representations instead of full frames.
  • This approach could be combined with encryption of the latent stream to add a lightweight identity verification layer to existing video codecs.
  • Auditing generative latents for hidden attributes without reconstructing outputs may become a general technique for securing other synthesis pipelines.
  • If the contrastive training succeeds without RGB supervision, similar methods might extract other persistent attributes such as age or ethnicity from driving signals.

Load-bearing premise

Biometric identity information persists in the pose-expression latent independently of transient pose and expression variations and can be reliably isolated by a pose-conditioned contrastive encoder trained without direct access to reconstructed RGB frames.

What would settle it

An experiment that drives the same pose sequence with two different identities, extracts the identity embeddings, and checks whether their cosine similarity remains low; if the embeddings fail to separate reliably, the detection method collapses.

Figures

Figures reproduced from arXiv: 2510.03548 by Danial Samadi Vahdati, David Luebke, Ekta Prashnani, Koki Nagano, Matthew Stamm, Orazio Gallo, Tai Duc Nguyen.

Figure 1
Figure 1. Figure 1: AI-based talking-head generators transmit only a compact pose-and-expression embedding [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of three datasets (NVIDIA-VC [ [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Similarity distributions in P&E space (left) and biometric leakage space (right). [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Detection AUC vs. window size and number of puppeteered identities during training. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Full-resolution version of Fig. 3 from the main paper. Similarity distributions in P&E space [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Full-resolution version of Fig. 4 from the main paper: Overview of our loss function [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim's likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that AI-based talking-head videoconferencing systems are vulnerable to real-time puppeteering attacks because the transmitted pose-expression latent can be manipulated to hijack a victim's likeness. It observes that this latent inherently leaks biometric information of the driving identity and introduces a pose-conditioned large-margin contrastive encoder to isolate persistent identity cues while cancelling transient pose and expression variations. Detection of illicit swaps is performed via a cosine similarity test on the resulting embedding, without ever accessing the reconstructed RGB video. Experiments are said to demonstrate consistent outperformance over existing defenses, real-time operation, and strong generalization to out-of-distribution cases across multiple talking-head models.

Significance. If the central claim holds, the work would be significant for addressing a practical security gap in real-time video communications where standard deepfake detectors fail on fully synthetic outputs. The direct use of the transmitted latent for biometric isolation, without RGB reconstruction, offers an efficient and deployable defense. Credit is given for the focus on real-time applicability and reported OOD generalization, which could inform future protocol designs if supported by rigorous validation.

major comments (3)
  1. [Abstract] Abstract: The claim of 'consistent outperformance on multiple talking-head models and generalization to out-of-distribution cases' is presented without any quantitative metrics, baselines, or experimental controls. This is load-bearing for the central claim, as it prevents assessment of whether the contrastive encoder reliably supports detection.
  2. [Method] Method section (encoder description): The pose-conditioned large-margin contrastive encoder is asserted to isolate persistent identity while cancelling transient pose/expression, but no explicit invariance verification (e.g., embedding variance under controlled pose changes for fixed identity) or ablation on the conditioning is provided. Without this, detected differences in cosine tests could arise from pose mismatch rather than identity swap, undermining the defense.
  3. [Experiments] Experiments section: The manuscript reports outperformance and real-time operation but omits specifics on training details, loss formulation, dataset splits, or numerical results (e.g., detection rates, false positives). This leaves the soundness of the biometric leakage exploitation unverified.
minor comments (2)
  1. [Method] Clarify the precise integration of pose conditioning into the encoder (e.g., via concatenation or modulation) and the margin value in the contrastive loss for reproducibility.
  2. [Related Work] Add a reference to prior work on latent-space identity leakage in generative models to better contextualize novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important areas where additional clarity and evidence will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of 'consistent outperformance on multiple talking-head models and generalization to out-of-distribution cases' is presented without any quantitative metrics, baselines, or experimental controls. This is load-bearing for the central claim, as it prevents assessment of whether the contrastive encoder reliably supports detection.

    Authors: We agree that the abstract would be more informative with key quantitative support for the performance claims. In the revised manuscript we will add concise numerical highlights drawn from the experiments (e.g., detection accuracy, FPR, and relative gains versus baselines) while preserving the abstract's brevity. This change will allow readers to evaluate the central claim without first consulting the full experimental section. revision: yes

  2. Referee: [Method] Method section (encoder description): The pose-conditioned large-margin contrastive encoder is asserted to isolate persistent identity while cancelling transient pose/expression, but no explicit invariance verification (e.g., embedding variance under controlled pose changes for fixed identity) or ablation on the conditioning is provided. Without this, detected differences in cosine tests could arise from pose mismatch rather than identity swap, undermining the defense.

    Authors: We acknowledge the value of explicit verification. The current manuscript describes the conditioning mechanism and loss but does not include a dedicated invariance study or conditioning ablation. We will add both: (1) quantitative embedding variance measurements for fixed identities across controlled pose/expression variations, and (2) an ablation removing the pose-conditioning input. These additions will directly demonstrate that identity separation is not driven by residual pose mismatch. revision: yes

  3. Referee: [Experiments] Experiments section: The manuscript reports outperformance and real-time operation but omits specifics on training details, loss formulation, dataset splits, or numerical results (e.g., detection rates, false positives). This leaves the soundness of the biometric leakage exploitation unverified.

    Authors: We agree that the experimental section requires greater specificity for reproducibility and verification. The revision will expand this section to include the precise training hyperparameters, the full contrastive loss formulation, dataset split details, and complete numerical results (detection rates, false-positive rates, latency measurements, and per-model comparisons). We will also add explicit controls and metrics for the out-of-distribution generalization experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central defense relies on trained encoder from empirical observation

full rationale

The paper's derivation begins from the stated observation that the pose-expression latent contains biometric identity information and proceeds by introducing a new pose-conditioned large-margin contrastive encoder trained to isolate persistent cues while cancelling transients. This is not an algebraic reduction, fitted-input prediction, or self-citation chain that collapses back to the inputs by construction; the encoder is presented as an independently trained component whose parameters are learned rather than predefined from the target result. No equations or prior-author uniqueness theorems are invoked in the provided description to force the outcome, and the method is self-contained as a machine-learned detector evaluated on multiple models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Information is limited to the abstract; no explicit free parameters, invented physical entities, or additional axioms beyond the core domain assumption are stated. The contrastive encoder itself is a methodological contribution rather than a new postulated entity.

axioms (1)
  • domain assumption The pose-expression latent inherently contains biometric information of the driving identity that persists independently of pose and expression.
    Directly stated as the key observation enabling the defense in the abstract.

pith-pipeline@v0.9.0 · 5727 in / 1307 out tokens · 38377 ms · 2026-05-18T09:53:30.255934+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages

  1. [1]

    Synthesizing realistic avatars for enhanced virtual communication.IEEE Transactions on Visualization and Computer Graphics, 29(5):1802–1815, 2023

    Hao Wang and Li Zhang. Synthesizing realistic avatars for enhanced virtual communication.IEEE Transactions on Visualization and Computer Graphics, 29(5):1802–1815, 2023. doi: 10.1109/TVCG.2023. 3214567. URLhttps://ieeexplore.ieee.org/document/3214567

  2. [2]

    Generative adversarial networks for hyper-realistic avatar creation

    Min-Jun Lee and Soo-Young Kim. Generative adversarial networks for hyper-realistic avatar creation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1234–1243, 2022. doi: 10.1109/CVPR.2022.001234. URL https://ieeexplore.ieee.org/ document/001234

  3. [3]

    Deep learning techniques for avatar-based interaction in virtual environments

    John Smith and Jane Doe. Deep learning techniques for avatar-based interaction in virtual environments. IEEE Transactions on Neural Networks and Learning Systems, 32(12):5600–5612, 2021. doi: 10.1109/ TNNLS.2021.3071234. URLhttps://ieeexplore.ieee.org/document/3071234

  4. [4]

    Ai-mediated 3d video conferencing

    Michael Stengel, Koki Nagano, Chao Liu, Matthew Chan, Alex Trevithick, Shalini De Mello, Jonghyun Kim, and David Luebke. Ai-mediated 3d video conferencing. InACM SIGGRAPH Emerging Technologies,

  5. [5]

    doi: 10.1145/3588037.3595385

  6. [6]

    Shwetha Rajaram, Nels Numan, Balasaravanan Thoravi Kumaravel, Nicolai Marquardt, and Andrew D. Wilson. Blendscape: Enabling unified and personalized video-conferencing environments through genera- tive ai.arXiv preprint arXiv:2403.13947, 2024

  7. [7]

    Faivconf: Face enhancement for ai-based video conference with low bit-rate.arXiv preprint arXiv:2207.04090, 2022

    Zhengang Li, Sheng Lin, Shan Liu, Songnan Li, Xue Lin, Wei Wang, and Wei Jiang. Faivconf: Face enhancement for ai-based video conference with low bit-rate.arXiv preprint arXiv:2207.04090, 2022

  8. [8]

    Multimodal active speaker detection and virtual cinematography for video conferencing.arXiv preprint arXiv:2002.03977, 2020

    Ross Cutler, Ramin Mehran, Sam Johnson, Cha Zhang, Adam Kirk, Oliver Whyte, and Adarsh Kowdle. Multimodal active speaker detection and virtual cinematography for video conferencing.arXiv preprint arXiv:2002.03977, 2020. 10

  9. [9]

    In: CVPR

    Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferencing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10039–10049, 2021. doi: 10.1109/CVPR46437.2021.00991

  10. [10]

    Multimodal semantic communication for generative audio-driven video conferencing.arXiv preprint arXiv:2410.22112, 2024

    Haonan Tong, Haopeng Li, Hongyang Du, Zhaohui Yang, Changchuan Yin, and Dusit Niyato. Multimodal semantic communication for generative audio-driven video conferencing.arXiv preprint arXiv:2410.22112, 2024

  11. [11]

    Defending low-bandwidth talking head videoconferencing systems from real-time puppeteering attacks

    Danial Samadi Vahdati, Tai Duc Nguyen, and Matthew C Stamm. Defending low-bandwidth talking head videoconferencing systems from real-time puppeteering attacks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 983–992, 2023

  12. [12]

    Avatar fingerprinting for authorized use of synthetic talking-head videos

    Ekta Prashnani, Koki Nagano, Shalini De Mello, David Luebke, and Orazio Gallo. Avatar fingerprinting for authorized use of synthetic talking-head videos. InEuropean Conference on Computer Vision, pages 209–228. Springer, 2024

  13. [13]

    Combining efficientnet and vision transformers for video deepfake detection

    Davide Alessandro Coccomini, Nicola Messina, Claudio Gennaro, and Fabrizio Falchi. Combining efficientnet and vision transformers for video deepfake detection. InInternational conference on image analysis and processing, pages 219–229. Springer, 2022

  14. [14]

    Video face manipulation detection through ensemble of cnns

    Nicolo Bonettini, Edoardo Daniele Cannas, Sara Mandelli, Luca Bondi, Paolo Bestagini, and Stefano Tubaro. Video face manipulation detection through ensemble of cnns. In2020 25th international conference on pattern recognition (ICPR), pages 5012–5019. IEEE, 2021

  15. [15]

    Tall: Thumbnail layout for deepfake video detection

    Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, and Ran He. Tall: Thumbnail layout for deepfake video detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 22658–22668, 2023

  16. [16]

    Supervised contrastive learning for generalizable and explainable deepfakes detection

    Ying Xu, Kiran Raja, and Marius Pedersen. Supervised contrastive learning for generalizable and explainable deepfakes detection. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 379–389, 2022

  17. [17]

    Laa-net: Localized artifact attention network for quality-agnostic and generalizable deepfake detection

    Dat Nguyen, Nesryne Mejri, Inder Pal Singh, Polina Kuleshova, Marcella Astrid, Anis Kacem, Enjie Ghorbel, and Djamila Aouada. Laa-net: Localized artifact attention network for quality-agnostic and generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17395–17405, 2024

  18. [18]

    Videofact: detecting video forgeries using attention, scene context, and forensic traces

    Tai D Nguyen, Shengbang Fang, and Matthew C Stamm. Videofact: detecting video forgeries using attention, scene context, and forensic traces. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8563–8573, 2024

  19. [19]

    Nguyen, Aref Azizpour, and Matthew C

    Danial Samadi Vahdati, Tai D. Nguyen, Aref Azizpour, and Matthew C. Stamm. Beyond deepfake images: Detecting ai-generated videos.arXiv preprint arXiv:2404.15955, 2024

  20. [20]

    Distinguish any fake videos: Unleashing the power of large-scale data and motion features.arXiv preprint arXiv:2405.15343, 2024

    Lichuan Ji, Yingqi Lin, Zhenhua Huang, Yan Han, Xiaogang Xu, Jiafei Wu, Chong Wang, and Zhe Liu. Distinguish any fake videos: Unleashing the power of large-scale data and motion features.arXiv preprint arXiv:2405.15343, 2024

  21. [21]

    What matters in detecting ai-generated videos like sora?

    Chirui Chang, Zhengzhe Liu, Xiaoyang Lyu, and Xiaojuan Qi. What matters in detecting ai-generated videos like sora?arXiv preprint arXiv:2406.19568, 2024

  22. [22]

    Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content.arXiv preprint arXiv:2412.12278, 2024

    Yuezun Li, Ming-Ching Chang, and Siwei Lyu. Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content.arXiv preprint arXiv:2412.12278, 2024

  23. [23]

    Generalizable and animatable gaussian head avatar.Advances in Neural Information Processing Systems, 37:57642–57670, 2024

    Xuangeng Chu and Tatsuya Harada. Generalizable and animatable gaussian head avatar.Advances in Neural Information Processing Systems, 37:57642–57670, 2024

  24. [24]

    Emoca: Emotion driven monocular face capture and animation

    Radek Danˇeˇcek, Michael J Black, and Timo Bolkart. Emoca: Emotion driven monocular face capture and animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20311–20322, 2022

  25. [25]

    Audio2head: Audio-driven one-shot talking-head generation with natural head motion

    Suzhen Wang, Lincheng Li, Yu Ding, Changjie Fan, and Xin Yu. Audio2head: Audio-driven one-shot talking-head generation with natural head motion. InProceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI), pages 1107–1113, 2021. doi: 10.24963/ijcai.2021/152

  26. [26]

    Emotalker: Audio driven emotion aware talking head generation

    Xiaoqian Shen, Yantong Wang, Zhenhua Liu, and Zhiyong Wang. Emotalker: Audio driven emotion aware talking head generation. InProceedings of the 17th Asian Conference on Computer Vision (ACCV), pages 123–137, 2024. doi: 10.1007/978-3-031-68418-0_9. 11

  27. [27]

    Expressive talking head video encoding in stylegan2 latent space.arXiv preprint arXiv:2203.14512, 2022

    Trevine Oorloff and Yaser Yacoob. Expressive talking head video encoding in stylegan2 latent space.arXiv preprint arXiv:2203.14512, 2022

  28. [28]

    Talking-head generation with rhythmic head motion

    Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. Talking-head generation with rhythmic head motion. InProceedings of the European Conference on Computer Vision (ECCV), pages 35–51, 2020

  29. [29]

    Namboodiri, and C.V

    Madhav Agarwal, Anchit Gupta, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V . Jawahar. Compressing video calls using synthetic talking heads. InProceedings of the British Machine Vision Conference (BMVC), pages 1–12, 2021

  30. [30]

    Depth-aware generative adversarial network for talking head video generation

    Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu. Depth-aware generative adversarial network for talking head video generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3397–3406, 2022

  31. [31]

    Dagan++: Depth-aware generative adversarial network for talking head video generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):2997–3012, 2023

    Fa-Ting Hong, Li Shen, and Dan Xu. Dagan++: Depth-aware generative adversarial network for talking head video generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):2997–3012, 2023

  32. [32]

    Implicit warping for animation with image sets

    Arun Mallya, Ting-Chun Wang, and Ming-Yu Liu. Implicit warping for animation with image sets. Advances in Neural Information Processing Systems, 35:22438–22450, 2022

  33. [33]

    Anifacegan: Animatable 3d-aware face image generation for video avatars.Advances in Neural Information Processing Systems, 35:36188–36201, 2022

    Yue Wu, Yu Deng, Jiaolong Yang, Fangyun Wei, Qifeng Chen, and Xin Tong. Anifacegan: Animatable 3d-aware face image generation for video avatars.Advances in Neural Information Processing Systems, 35:36188–36201, 2022

  34. [34]

    Talking face generation with multilingual tts

    Hyoung-Kyu Song, Sang Hoon Woo, Junhyeok Lee, Seungmin Yang, Hyunjae Cho, Youseong Lee, Dongho Choi, and Kang-wook Kim. Talking face generation with multilingual tts. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 21425–21430, 2022

  35. [35]

    High-fidelity and freely controllable talking head video generation

    Yue Gao, Yuan Zhou, Jinglu Wang, Xiao Li, Xiang Ming, and Yan Lu. High-fidelity and freely controllable talking head video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10039–10049, 2023. doi: 10.1109/CVPR2023.00991

  36. [36]

    Discohead: Audio-and-video-driven talking head generation by disentangled control of head pose and facial expressions

    Geumbyeol Hwang, Sunwon Hong, Seunghyun Lee, Sungwoo Park, and Gyeongsu Chae. Discohead: Audio-and-video-driven talking head generation by disentangled control of head pose and facial expressions. arXiv preprint arXiv:2303.07697, 2023

  37. [37]

    Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion.arXiv preprint arXiv:2411.16726, 2024

    Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, Bing Yin, Cong Liu, and Qingfeng Liu. Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion.arXiv preprint arXiv:2411.16726, 2024

  38. [38]

    Interactive conversational head generation.arXiv preprint arXiv:2307.02090, 2023

    Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, Bing Yin, Cong Liu, and Qingfeng Liu. Interactive conversational head generation.arXiv preprint arXiv:2307.02090, 2023

  39. [39]

    R2-talker: Realistic real-time talking head synthesis with hash grid landmarks encoding and progressive multilayer conditioning.arXiv preprint arXiv:2312.05572, 2023

    Zhiling Ye, LiangGuo Zhang, Dingheng Zeng, Quan Lu, and Ning Jiang. R2-talker: Realistic real-time talking head synthesis with hash grid landmarks encoding and progressive multilayer conditioning.arXiv preprint arXiv:2312.05572, 2023

  40. [40]

    Efficient conditioned face animation using frontally-viewed embedding.arXiv preprint arXiv:2203.08765, 2022

    Maxime Oquab, Daniel Haziza, Ludovic Schwartz, Tao Xu, Katayoun Zand, Rui Wang, Peirong Liu, and Camille Couprie. Efficient conditioned face animation using frontally-viewed embedding.arXiv preprint arXiv:2203.08765, 2022

  41. [41]

    arXiv preprint arXiv:2301.12345 , year =

    John Doe, Jane Smith, and Alan Turing. Txt2vid: Ultra-low bitrate compression of talking-head videos via text.arXiv preprint arXiv:2301.12345, 2023

  42. [42]

    Goldman, Supreeth Achar, Gregory Major Blascovich, Joseph G

    Jason Lawrence, Dan B. Goldman, Supreeth Achar, Gregory Major Blascovich, Joseph G. Desloge, Tommy Fortes, Eric M. Gomez, Sascha Häberling, Hugues Hoppe, Andy Huibers, Claude Knaus, Brian Kuschak, Ricardo Martin-Brualla, Harris Nover, Andrew Ian Russell, Steven M. Seitz, and Kevin Tong. Project starline: A high-fidelity telepresence system.ACM Transaction...

  43. [43]

    Fakeout: Leveraging out-of-domain self-supervision for multi-modal video deepfake detection.arXiv preprint arXiv:2212.00773, 2022

    Gil Knafo and Ohad Fried. Fakeout: Leveraging out-of-domain self-supervision for multi-modal video deepfake detection.arXiv preprint arXiv:2212.00773, 2022. 12

  44. [44]

    Emerging properties in self-supervised vision transformers

    Shu Hu, Yuezun Li, and Siwei Lyu. Exposing gan-generated faces using inconsistent corneal specular highlights. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1506–1515, 2021. doi: 10.1109/ICCV48922.2021.00154

  45. [45]

    Detecting deep-fake videos from phoneme- viseme mismatches

    Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Detecting deep-fake videos from phoneme- viseme mismatches. InProceedings of the 27th ACM International Conference on Multimedia (MM), pages 1136–1145, 2018. doi: 10.1145/3343031.3350928

  46. [46]

    Deeprhythm: Exposing deepfakes with attentional visual heartbeat rhythms

    Yuezun Li, Ming-Ching Chang, and Siwei Lyu. Deeprhythm: Exposing deepfakes with attentional visual heartbeat rhythms. InProceedings of the 28th ACM International Conference on Multimedia (MM), pages 1411–1419, 2020. doi: 10.1145/3394171.3413651

  47. [47]

    Emerging properties in self-supervised vision transformers

    Ilke Demir, Umur Aybars Ciftci, and Lijun Yin. Fakecatcher: Detection of synthetic portrait videos using biological signals. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1432–1441, 2021. doi: 10.1109/ICCV48922.2021.00147

  48. [48]

    Capturing the lighting inconsistency for deepfake detection

    Wei-Ming Zhang, Xue-Jie Zhang, and Yu-Feng Li. Capturing the lighting inconsistency for deepfake detection. InProceedings of the European Conference on Computer Vision (ECCV), pages 812–828, 2022. doi: 10.1007/978-3-031-06788-4_52

  49. [49]

    Illumination enlightened spatial-temporal inconsistency for deepfake video detection

    Wei-Ming Zhang, Xue-Jie Zhang, and Yu-Feng Li. Illumination enlightened spatial-temporal inconsistency for deepfake video detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12345–12354, 2023. doi: 10.1109/CVPR2023.01234

  50. [50]

    Lideepdet: Deepfake detection via image decomposition and inconsistency analysis.Electronics, 13(22):4466, 2023

    Wei-Ming Zhang, Xue-Jie Zhang, and Yu-Feng Li. Lideepdet: Deepfake detection via image decomposition and inconsistency analysis.Electronics, 13(22):4466, 2023. doi: 10.3390/electronics13224466

  51. [51]

    Deep- fake detection by exploiting surface anomalies: The surfake approach.arXiv preprint arXiv:2310.20621, 2023

    Andrea Ciamarra, Roberto Caldelli, Federico Becattini, Lorenzo Seidenari, and Alberto Del Bimbo. Deep- fake detection by exploiting surface anomalies: The surfake approach.arXiv preprint arXiv:2310.20621, 2023

  52. [52]

    IEEE Transactions on Multimedia 25, 942– 952 (2020) https://doi.org/10.1109/tmm

    Zongmei Chen, Xin Liao, and Xiaoshuai Wu. Aim-bone: Texture discrepancy generation and localization for generalized deepfake detection.IEEE Transactions on Multimedia, 25:1–13, 2023. doi: 10.1109/TMM. 2023.3245678

  53. [53]

    Deepfake detection and localization using multi-view inconsistency measurement.IEEE Transactions on Multimedia, 25:1–12, 2023

    Zhiwei Xiong, Wei Wang, and Xiaochun Cao. Deepfake detection and localization using multi-view inconsistency measurement.IEEE Transactions on Multimedia, 25:1–12, 2023. doi: 10.1109/TMM.2023. 3234567

  54. [54]

    Artifacts-disentangled adversarial learning for deepfake detection

    Haodong Li, Bin Li, and Shunquan Tan. Artifacts-disentangled adversarial learning for deepfake detection. IEEE Transactions on Information Forensics and Security, 17:1–14, 2022. doi: 10.1109/TIFS.2022. 3142356

  55. [55]

    Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PloS one, 13(5):e0196391, 2018

  56. [56]

    Crema-d: Crowd-sourced emotional multimodal actors dataset.IEEE transactions on affective computing, 5(4):377–390, 2014

    Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. Crema-d: Crowd-sourced emotional multimodal actors dataset.IEEE transactions on affective computing, 5(4):377–390, 2014

  57. [57]

    Synergizing motion and appearance: Multi- scale compensatory codebooks for talking head video generation.arXiv preprint arXiv:2412.00719, 2024

    Shuling Zhao, Fa-Ting Hong, Xiaoshui Huang, and Dan Xu. Synergizing motion and appearance: Multi- scale compensatory codebooks for talking head video generation.arXiv preprint arXiv:2412.00719, 2024

  58. [58]

    Say anything with any style

    Shuai Tan, Bin Ji, Yu Ding, and Ye Pan. Say anything with any style. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5088–5096, 2024

  59. [59]

    Cvthead: One-shot con- trollable head avatar with vertex-feature transformer

    Haoyu Ma, Tong Zhang, Shanlin Sun, Xiangyi Yan, Kun Han, and Xiaohui Xie. Cvthead: One-shot con- trollable head avatar with vertex-feature transformer. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6131–6141, 2024

  60. [60]

    Learning dynamic facial radiance fields for few-shot talking head synthesis

    Shuai Shen, Wanhua Li, Zheng Zhu, Yueqi Duan, Jie Zhou, and Jiwen Lu. Learning dynamic facial radiance fields for few-shot talking head synthesis. InEuropean conference on computer vision, pages 666–682. Springer, 2022

  61. [61]

    Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

    Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8652–8661, 2023. 13

  62. [62]

    Latent image animator: Learning to animate images via latent space navigation.arXiv preprint arXiv:2203.09043, 2022

    Yaohui Wang, Di Yang, Francois Bremond, and Antitza Dantcheva. Latent image animator: Learning to animate images via latent space navigation.arXiv preprint arXiv:2203.09043, 2022

  63. [63]

    Audio-visual face reenactment

    Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. Audio-visual face reenactment. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5178–5187, 2023

  64. [64]

    Cosface: Large margin cosine loss for deep face recognition

    Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5265–5274, 2018

  65. [65]

    Adaface: Quality adaptive margin for face recognition

    Minchul Kim, Anil K Jain, and Xiaoming Liu. Adaface: Quality adaptive margin for face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18750–18759, 2022

  66. [66]

    Generalization analysis for contrastive representation learning

    Yunwen Lei, Tianbao Yang, Yiming Ying, and Ding-Xuan Zhou. Generalization analysis for contrastive representation learning. InInternational Conference on Machine Learning, pages 19200–19227. PMLR, 2023

  67. [67]

    Understanding contrastive learning through the lens of margins.arXiv preprint arXiv:2306.11526, 2023

    Daniel Rho, TaeSoo Kim, Sooill Park, Jaehyun Park, and JaeHan Park. Understanding contrastive learning through the lens of margins.arXiv preprint arXiv:2306.11526, 2023

  68. [68]

    3dfaceshop: Explicitly controllable 3d-aware portrait generation.IEEE transactions on visualization and computer graphics, 30(9):6020–6037, 2023

    Junshu Tang, Bo Zhang, Binxin Yang, Ting Zhang, Dong Chen, Lizhuang Ma, and Fang Wen. 3dfaceshop: Explicitly controllable 3d-aware portrait generation.IEEE transactions on visualization and computer graphics, 30(9):6020–6037, 2023

  69. [69]

    Implicit identity representation conditioned memory compensation network for talking head video generation

    Fa-Ting Hong and Dan Xu. Implicit identity representation conditioned memory compensation network for talking head video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23062–23072, 2023

  70. [70]

    Emoportraits: Emotion-enhanced multimodal one-shot head avatars

    Nikita Drobyshev, Antoni Bigata Casademunt, Konstantinos V ougioukas, Zoe Landgraf, Stavros Petridis, and Maja Pantic. Emoportraits: Emotion-enhanced multimodal one-shot head avatars. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8498–8507, 2024

  71. [71]

    Finding directions in gan’s latent space for neural face reenactment.arXiv preprint arXiv:2202.00046, 2022

    Stella Bounareli, Vasileios Argyriou, and Georgios Tzimiropoulos. Finding directions in gan’s latent space for neural face reenactment.arXiv preprint arXiv:2202.00046, 2022

  72. [72]

    Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168,

    Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168, 2024. 14 Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing Supplementary Material A. Addit...