Unmasking Puppeteers: Leveraging Biometric Leakage to Expose Impersonation in AI-Based Videoconferencing

Danial Samadi Vahdati; David Luebke; Ekta Prashnani; Koki Nagano; Matthew Stamm; Orazio Gallo; Tai Duc Nguyen

arxiv: 2510.03548 · v4 · submitted 2025-10-03 · 💻 cs.CV · cs.AI

Unmasking Puppeteers: Leveraging Biometric Leakage to Expose Impersonation in AI-Based Videoconferencing

Danial Samadi Vahdati , Tai Duc Nguyen , Ekta Prashnani , Koki Nagano , David Luebke , Orazio Gallo , Matthew Stamm This is my paper

Pith reviewed 2026-05-18 09:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords biometric leakagepuppeteering defensetalking-head synthesiscontrastive encoderidentity detectionlatent space securityvideoconferencingAI video defense

0 comments

The pith

The pose-expression latent in AI talking-head videoconferencing carries persistent biometric identity information that can be isolated to detect impersonation without inspecting the output video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AI videoconferencing systems send a compact latent encoding only pose and expression to reconstruct video at the receiver, but an attacker can replace the driving identity in that latent to puppeteer the victim's face in real time. Standard deepfake detectors fail because every rendered frame is synthetic. The paper establishes that this latent still retains stable biometric cues about the original driver. It introduces a pose-conditioned contrastive encoder trained to pull out those identity cues while discarding pose and expression changes. A cosine similarity check on the resulting embedding then flags when the driver identity has been swapped.

Core claim

We observe that the pose-expression latent inherently contains biometric information of the driving identity. We introduce a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered.

What carries the argument

pose-conditioned large-margin contrastive encoder that isolates persistent identity cues inside the transmitted pose-expression latent while suppressing transient pose and expression variations

If this is right

Enables real-time detection of puppeteering attacks directly on the transmitted latent.
Outperforms prior puppeteering defenses across multiple talking-head generation models.
Generalizes to out-of-distribution driving identities and poses without retraining.
Operates without ever accessing or reconstructing the final RGB video frames.
Maintains low latency suitable for live videoconferencing pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same leakage principle may apply to other latent-based generative systems that transmit intermediate representations instead of full frames.
This approach could be combined with encryption of the latent stream to add a lightweight identity verification layer to existing video codecs.
Auditing generative latents for hidden attributes without reconstructing outputs may become a general technique for securing other synthesis pipelines.
If the contrastive training succeeds without RGB supervision, similar methods might extract other persistent attributes such as age or ethnicity from driving signals.

Load-bearing premise

Biometric identity information persists in the pose-expression latent independently of transient pose and expression variations and can be reliably isolated by a pose-conditioned contrastive encoder trained without direct access to reconstructed RGB frames.

What would settle it

An experiment that drives the same pose sequence with two different identities, extracts the identity embeddings, and checks whether their cosine similarity remains low; if the embeddings fail to separate reliably, the detection method collapses.

Figures

Figures reproduced from arXiv: 2510.03548 by Danial Samadi Vahdati, David Luebke, Ekta Prashnani, Koki Nagano, Matthew Stamm, Orazio Gallo, Tai Duc Nguyen.

**Figure 2.** Figure 2: Illustration of three datasets (NVIDIA-VC [ [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Similarity distributions in P&E space (left) and biometric leakage space (right). [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Detection AUC vs. window size and number of puppeteered identities during training. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Full-resolution version of Fig. 3 from the main paper. Similarity distributions in P&E space [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Full-resolution version of Fig. 4 from the main paper: Overview of our loss function [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim's likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows how to detect puppeteering in latent-based video calls by training a pose-conditioned contrastive encoder to pull biometric identity cues out of the transmitted latent.

read the letter

The main thing here is a defense that flags identity swaps in AI videoconferencing without ever looking at the output frames. They start from the observation that the pose-expression latent carries persistent biometric information about the driver. From there they train a pose-conditioned large-margin contrastive encoder to isolate those identity signals while suppressing the changing pose and expression, then use a cosine test on the resulting embedding for real-time detection. This is presented as the first approach that works directly on the latent rather than on reconstructed RGB, which is a reasonable response to the fact that standard synthetic-video detectors fail when the entire output is generated.

Referee Report

3 major / 2 minor

Summary. The paper claims that AI-based talking-head videoconferencing systems are vulnerable to real-time puppeteering attacks because the transmitted pose-expression latent can be manipulated to hijack a victim's likeness. It observes that this latent inherently leaks biometric information of the driving identity and introduces a pose-conditioned large-margin contrastive encoder to isolate persistent identity cues while cancelling transient pose and expression variations. Detection of illicit swaps is performed via a cosine similarity test on the resulting embedding, without ever accessing the reconstructed RGB video. Experiments are said to demonstrate consistent outperformance over existing defenses, real-time operation, and strong generalization to out-of-distribution cases across multiple talking-head models.

Significance. If the central claim holds, the work would be significant for addressing a practical security gap in real-time video communications where standard deepfake detectors fail on fully synthetic outputs. The direct use of the transmitted latent for biometric isolation, without RGB reconstruction, offers an efficient and deployable defense. Credit is given for the focus on real-time applicability and reported OOD generalization, which could inform future protocol designs if supported by rigorous validation.

major comments (3)

[Abstract] Abstract: The claim of 'consistent outperformance on multiple talking-head models and generalization to out-of-distribution cases' is presented without any quantitative metrics, baselines, or experimental controls. This is load-bearing for the central claim, as it prevents assessment of whether the contrastive encoder reliably supports detection.
[Method] Method section (encoder description): The pose-conditioned large-margin contrastive encoder is asserted to isolate persistent identity while cancelling transient pose/expression, but no explicit invariance verification (e.g., embedding variance under controlled pose changes for fixed identity) or ablation on the conditioning is provided. Without this, detected differences in cosine tests could arise from pose mismatch rather than identity swap, undermining the defense.
[Experiments] Experiments section: The manuscript reports outperformance and real-time operation but omits specifics on training details, loss formulation, dataset splits, or numerical results (e.g., detection rates, false positives). This leaves the soundness of the biometric leakage exploitation unverified.

minor comments (2)

[Method] Clarify the precise integration of pose conditioning into the encoder (e.g., via concatenation or modulation) and the margin value in the contrastive loss for reproducibility.
[Related Work] Add a reference to prior work on latent-space identity leakage in generative models to better contextualize novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important areas where additional clarity and evidence will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: The claim of 'consistent outperformance on multiple talking-head models and generalization to out-of-distribution cases' is presented without any quantitative metrics, baselines, or experimental controls. This is load-bearing for the central claim, as it prevents assessment of whether the contrastive encoder reliably supports detection.

Authors: We agree that the abstract would be more informative with key quantitative support for the performance claims. In the revised manuscript we will add concise numerical highlights drawn from the experiments (e.g., detection accuracy, FPR, and relative gains versus baselines) while preserving the abstract's brevity. This change will allow readers to evaluate the central claim without first consulting the full experimental section. revision: yes
Referee: [Method] Method section (encoder description): The pose-conditioned large-margin contrastive encoder is asserted to isolate persistent identity while cancelling transient pose/expression, but no explicit invariance verification (e.g., embedding variance under controlled pose changes for fixed identity) or ablation on the conditioning is provided. Without this, detected differences in cosine tests could arise from pose mismatch rather than identity swap, undermining the defense.

Authors: We acknowledge the value of explicit verification. The current manuscript describes the conditioning mechanism and loss but does not include a dedicated invariance study or conditioning ablation. We will add both: (1) quantitative embedding variance measurements for fixed identities across controlled pose/expression variations, and (2) an ablation removing the pose-conditioning input. These additions will directly demonstrate that identity separation is not driven by residual pose mismatch. revision: yes
Referee: [Experiments] Experiments section: The manuscript reports outperformance and real-time operation but omits specifics on training details, loss formulation, dataset splits, or numerical results (e.g., detection rates, false positives). This leaves the soundness of the biometric leakage exploitation unverified.

Authors: We agree that the experimental section requires greater specificity for reproducibility and verification. The revision will expand this section to include the precise training hyperparameters, the full contrastive loss formulation, dataset split details, and complete numerical results (detection rates, false-positive rates, latency measurements, and per-model comparisons). We will also add explicit controls and metrics for the out-of-distribution generalization experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central defense relies on trained encoder from empirical observation

full rationale

The paper's derivation begins from the stated observation that the pose-expression latent contains biometric identity information and proceeds by introducing a new pose-conditioned large-margin contrastive encoder trained to isolate persistent cues while cancelling transients. This is not an algebraic reduction, fitted-input prediction, or self-citation chain that collapses back to the inputs by construction; the encoder is presented as an independently trained component whose parameters are learned rather than predefined from the target result. No equations or prior-author uniqueness theorems are invoked in the provided description to force the outcome, and the method is self-contained as a machine-learned detector evaluated on multiple models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Information is limited to the abstract; no explicit free parameters, invented physical entities, or additional axioms beyond the core domain assumption are stated. The contrastive encoder itself is a methodological contribution rather than a new postulated entity.

axioms (1)

domain assumption The pose-expression latent inherently contains biometric information of the driving identity that persists independently of pose and expression.
Directly stated as the key observation enabling the defense in the abstract.

pith-pipeline@v0.9.0 · 5727 in / 1307 out tokens · 38377 ms · 2026-05-18T09:53:30.255934+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages

[1]

Synthesizing realistic avatars for enhanced virtual communication.IEEE Transactions on Visualization and Computer Graphics, 29(5):1802–1815, 2023

Hao Wang and Li Zhang. Synthesizing realistic avatars for enhanced virtual communication.IEEE Transactions on Visualization and Computer Graphics, 29(5):1802–1815, 2023. doi: 10.1109/TVCG.2023. 3214567. URLhttps://ieeexplore.ieee.org/document/3214567

work page doi:10.1109/tvcg.2023 2023
[2]

Generative adversarial networks for hyper-realistic avatar creation

Min-Jun Lee and Soo-Young Kim. Generative adversarial networks for hyper-realistic avatar creation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1234–1243, 2022. doi: 10.1109/CVPR.2022.001234. URL https://ieeexplore.ieee.org/ document/001234

work page doi:10.1109/cvpr.2022.001234 2022
[3]

Deep learning techniques for avatar-based interaction in virtual environments

John Smith and Jane Doe. Deep learning techniques for avatar-based interaction in virtual environments. IEEE Transactions on Neural Networks and Learning Systems, 32(12):5600–5612, 2021. doi: 10.1109/ TNNLS.2021.3071234. URLhttps://ieeexplore.ieee.org/document/3071234

work page arXiv 2021
[4]

Ai-mediated 3d video conferencing

Michael Stengel, Koki Nagano, Chao Liu, Matthew Chan, Alex Trevithick, Shalini De Mello, Jonghyun Kim, and David Luebke. Ai-mediated 3d video conferencing. InACM SIGGRAPH Emerging Technologies,

work page
[5]

doi: 10.1145/3588037.3595385

work page doi:10.1145/3588037.3595385
[6]

Shwetha Rajaram, Nels Numan, Balasaravanan Thoravi Kumaravel, Nicolai Marquardt, and Andrew D. Wilson. Blendscape: Enabling unified and personalized video-conferencing environments through genera- tive ai.arXiv preprint arXiv:2403.13947, 2024

work page arXiv 2024
[7]

Faivconf: Face enhancement for ai-based video conference with low bit-rate.arXiv preprint arXiv:2207.04090, 2022

Zhengang Li, Sheng Lin, Shan Liu, Songnan Li, Xue Lin, Wei Wang, and Wei Jiang. Faivconf: Face enhancement for ai-based video conference with low bit-rate.arXiv preprint arXiv:2207.04090, 2022

work page arXiv 2022
[8]

Multimodal active speaker detection and virtual cinematography for video conferencing.arXiv preprint arXiv:2002.03977, 2020

Ross Cutler, Ramin Mehran, Sam Johnson, Cha Zhang, Adam Kirk, Oliver Whyte, and Adarsh Kowdle. Multimodal active speaker detection and virtual cinematography for video conferencing.arXiv preprint arXiv:2002.03977, 2020. 10

work page arXiv 2002
[9]

In: CVPR

Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferencing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10039–10049, 2021. doi: 10.1109/CVPR46437.2021.00991

work page doi:10.1109/cvpr46437.2021.00991 2021
[10]

Multimodal semantic communication for generative audio-driven video conferencing.arXiv preprint arXiv:2410.22112, 2024

Haonan Tong, Haopeng Li, Hongyang Du, Zhaohui Yang, Changchuan Yin, and Dusit Niyato. Multimodal semantic communication for generative audio-driven video conferencing.arXiv preprint arXiv:2410.22112, 2024

work page arXiv 2024
[11]

Defending low-bandwidth talking head videoconferencing systems from real-time puppeteering attacks

Danial Samadi Vahdati, Tai Duc Nguyen, and Matthew C Stamm. Defending low-bandwidth talking head videoconferencing systems from real-time puppeteering attacks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 983–992, 2023

work page 2023
[12]

Avatar fingerprinting for authorized use of synthetic talking-head videos

Ekta Prashnani, Koki Nagano, Shalini De Mello, David Luebke, and Orazio Gallo. Avatar fingerprinting for authorized use of synthetic talking-head videos. InEuropean Conference on Computer Vision, pages 209–228. Springer, 2024

work page 2024
[13]

Combining efficientnet and vision transformers for video deepfake detection

Davide Alessandro Coccomini, Nicola Messina, Claudio Gennaro, and Fabrizio Falchi. Combining efficientnet and vision transformers for video deepfake detection. InInternational conference on image analysis and processing, pages 219–229. Springer, 2022

work page 2022
[14]

Video face manipulation detection through ensemble of cnns

Nicolo Bonettini, Edoardo Daniele Cannas, Sara Mandelli, Luca Bondi, Paolo Bestagini, and Stefano Tubaro. Video face manipulation detection through ensemble of cnns. In2020 25th international conference on pattern recognition (ICPR), pages 5012–5019. IEEE, 2021

work page 2021
[15]

Tall: Thumbnail layout for deepfake video detection

Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, and Ran He. Tall: Thumbnail layout for deepfake video detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 22658–22668, 2023

work page 2023
[16]

Supervised contrastive learning for generalizable and explainable deepfakes detection

Ying Xu, Kiran Raja, and Marius Pedersen. Supervised contrastive learning for generalizable and explainable deepfakes detection. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 379–389, 2022

work page 2022
[17]

Laa-net: Localized artifact attention network for quality-agnostic and generalizable deepfake detection

Dat Nguyen, Nesryne Mejri, Inder Pal Singh, Polina Kuleshova, Marcella Astrid, Anis Kacem, Enjie Ghorbel, and Djamila Aouada. Laa-net: Localized artifact attention network for quality-agnostic and generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17395–17405, 2024

work page 2024
[18]

Videofact: detecting video forgeries using attention, scene context, and forensic traces

Tai D Nguyen, Shengbang Fang, and Matthew C Stamm. Videofact: detecting video forgeries using attention, scene context, and forensic traces. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8563–8573, 2024

work page 2024
[19]

Nguyen, Aref Azizpour, and Matthew C

Danial Samadi Vahdati, Tai D. Nguyen, Aref Azizpour, and Matthew C. Stamm. Beyond deepfake images: Detecting ai-generated videos.arXiv preprint arXiv:2404.15955, 2024

work page arXiv 2024
[20]

Distinguish any fake videos: Unleashing the power of large-scale data and motion features.arXiv preprint arXiv:2405.15343, 2024

Lichuan Ji, Yingqi Lin, Zhenhua Huang, Yan Han, Xiaogang Xu, Jiafei Wu, Chong Wang, and Zhe Liu. Distinguish any fake videos: Unleashing the power of large-scale data and motion features.arXiv preprint arXiv:2405.15343, 2024

work page arXiv 2024
[21]

What matters in detecting ai-generated videos like sora?

Chirui Chang, Zhengzhe Liu, Xiaoyang Lyu, and Xiaojuan Qi. What matters in detecting ai-generated videos like sora?arXiv preprint arXiv:2406.19568, 2024

work page arXiv 2024
[22]

Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content.arXiv preprint arXiv:2412.12278, 2024

Yuezun Li, Ming-Ching Chang, and Siwei Lyu. Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content.arXiv preprint arXiv:2412.12278, 2024

work page arXiv 2024
[23]

Generalizable and animatable gaussian head avatar.Advances in Neural Information Processing Systems, 37:57642–57670, 2024

Xuangeng Chu and Tatsuya Harada. Generalizable and animatable gaussian head avatar.Advances in Neural Information Processing Systems, 37:57642–57670, 2024

work page 2024
[24]

Emoca: Emotion driven monocular face capture and animation

Radek Danˇeˇcek, Michael J Black, and Timo Bolkart. Emoca: Emotion driven monocular face capture and animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20311–20322, 2022

work page 2022
[25]

Audio2head: Audio-driven one-shot talking-head generation with natural head motion

Suzhen Wang, Lincheng Li, Yu Ding, Changjie Fan, and Xin Yu. Audio2head: Audio-driven one-shot talking-head generation with natural head motion. InProceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI), pages 1107–1113, 2021. doi: 10.24963/ijcai.2021/152

work page doi:10.24963/ijcai.2021/152 2021
[26]

Emotalker: Audio driven emotion aware talking head generation

Xiaoqian Shen, Yantong Wang, Zhenhua Liu, and Zhiyong Wang. Emotalker: Audio driven emotion aware talking head generation. InProceedings of the 17th Asian Conference on Computer Vision (ACCV), pages 123–137, 2024. doi: 10.1007/978-3-031-68418-0_9. 11

work page doi:10.1007/978-3-031-68418-0_9 2024
[27]

Expressive talking head video encoding in stylegan2 latent space.arXiv preprint arXiv:2203.14512, 2022

Trevine Oorloff and Yaser Yacoob. Expressive talking head video encoding in stylegan2 latent space.arXiv preprint arXiv:2203.14512, 2022

work page arXiv 2022
[28]

Talking-head generation with rhythmic head motion

Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. Talking-head generation with rhythmic head motion. InProceedings of the European Conference on Computer Vision (ECCV), pages 35–51, 2020

work page 2020
[29]

Namboodiri, and C.V

Madhav Agarwal, Anchit Gupta, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V . Jawahar. Compressing video calls using synthetic talking heads. InProceedings of the British Machine Vision Conference (BMVC), pages 1–12, 2021

work page 2021
[30]

Depth-aware generative adversarial network for talking head video generation

Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu. Depth-aware generative adversarial network for talking head video generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3397–3406, 2022

work page 2022
[31]

Dagan++: Depth-aware generative adversarial network for talking head video generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):2997–3012, 2023

Fa-Ting Hong, Li Shen, and Dan Xu. Dagan++: Depth-aware generative adversarial network for talking head video generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):2997–3012, 2023

work page 2023
[32]

Implicit warping for animation with image sets

Arun Mallya, Ting-Chun Wang, and Ming-Yu Liu. Implicit warping for animation with image sets. Advances in Neural Information Processing Systems, 35:22438–22450, 2022

work page 2022
[33]

Anifacegan: Animatable 3d-aware face image generation for video avatars.Advances in Neural Information Processing Systems, 35:36188–36201, 2022

Yue Wu, Yu Deng, Jiaolong Yang, Fangyun Wei, Qifeng Chen, and Xin Tong. Anifacegan: Animatable 3d-aware face image generation for video avatars.Advances in Neural Information Processing Systems, 35:36188–36201, 2022

work page 2022
[34]

Talking face generation with multilingual tts

Hyoung-Kyu Song, Sang Hoon Woo, Junhyeok Lee, Seungmin Yang, Hyunjae Cho, Youseong Lee, Dongho Choi, and Kang-wook Kim. Talking face generation with multilingual tts. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 21425–21430, 2022

work page 2022
[35]

High-fidelity and freely controllable talking head video generation

Yue Gao, Yuan Zhou, Jinglu Wang, Xiao Li, Xiang Ming, and Yan Lu. High-fidelity and freely controllable talking head video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10039–10049, 2023. doi: 10.1109/CVPR2023.00991

work page doi:10.1109/cvpr2023.00991 2023
[36]

Discohead: Audio-and-video-driven talking head generation by disentangled control of head pose and facial expressions

Geumbyeol Hwang, Sunwon Hong, Seunghyun Lee, Sungwoo Park, and Gyeongsu Chae. Discohead: Audio-and-video-driven talking head generation by disentangled control of head pose and facial expressions. arXiv preprint arXiv:2303.07697, 2023

work page arXiv 2023
[37]

Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion.arXiv preprint arXiv:2411.16726, 2024

Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, Bing Yin, Cong Liu, and Qingfeng Liu. Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion.arXiv preprint arXiv:2411.16726, 2024

work page arXiv 2024
[38]

Interactive conversational head generation.arXiv preprint arXiv:2307.02090, 2023

Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, Bing Yin, Cong Liu, and Qingfeng Liu. Interactive conversational head generation.arXiv preprint arXiv:2307.02090, 2023

work page arXiv 2023
[39]

R2-talker: Realistic real-time talking head synthesis with hash grid landmarks encoding and progressive multilayer conditioning.arXiv preprint arXiv:2312.05572, 2023

Zhiling Ye, LiangGuo Zhang, Dingheng Zeng, Quan Lu, and Ning Jiang. R2-talker: Realistic real-time talking head synthesis with hash grid landmarks encoding and progressive multilayer conditioning.arXiv preprint arXiv:2312.05572, 2023

work page arXiv 2023
[40]

Efficient conditioned face animation using frontally-viewed embedding.arXiv preprint arXiv:2203.08765, 2022

Maxime Oquab, Daniel Haziza, Ludovic Schwartz, Tao Xu, Katayoun Zand, Rui Wang, Peirong Liu, and Camille Couprie. Efficient conditioned face animation using frontally-viewed embedding.arXiv preprint arXiv:2203.08765, 2022

work page arXiv 2022
[41]

arXiv preprint arXiv:2301.12345 , year =

John Doe, Jane Smith, and Alan Turing. Txt2vid: Ultra-low bitrate compression of talking-head videos via text.arXiv preprint arXiv:2301.12345, 2023

work page arXiv 2023
[42]

Goldman, Supreeth Achar, Gregory Major Blascovich, Joseph G

Jason Lawrence, Dan B. Goldman, Supreeth Achar, Gregory Major Blascovich, Joseph G. Desloge, Tommy Fortes, Eric M. Gomez, Sascha Häberling, Hugues Hoppe, Andy Huibers, Claude Knaus, Brian Kuschak, Ricardo Martin-Brualla, Harris Nover, Andrew Ian Russell, Steven M. Seitz, and Kevin Tong. Project starline: A high-fidelity telepresence system.ACM Transaction...

work page doi:10.1145/3478513.3480490 2021
[43]

Fakeout: Leveraging out-of-domain self-supervision for multi-modal video deepfake detection.arXiv preprint arXiv:2212.00773, 2022

Gil Knafo and Ohad Fried. Fakeout: Leveraging out-of-domain self-supervision for multi-modal video deepfake detection.arXiv preprint arXiv:2212.00773, 2022. 12

work page arXiv 2022
[44]

Emerging properties in self-supervised vision transformers

Shu Hu, Yuezun Li, and Siwei Lyu. Exposing gan-generated faces using inconsistent corneal specular highlights. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1506–1515, 2021. doi: 10.1109/ICCV48922.2021.00154

work page doi:10.1109/iccv48922.2021.00154 2021
[45]

Detecting deep-fake videos from phoneme- viseme mismatches

Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Detecting deep-fake videos from phoneme- viseme mismatches. InProceedings of the 27th ACM International Conference on Multimedia (MM), pages 1136–1145, 2018. doi: 10.1145/3343031.3350928

work page doi:10.1145/3343031.3350928 2018
[46]

Deeprhythm: Exposing deepfakes with attentional visual heartbeat rhythms

Yuezun Li, Ming-Ching Chang, and Siwei Lyu. Deeprhythm: Exposing deepfakes with attentional visual heartbeat rhythms. InProceedings of the 28th ACM International Conference on Multimedia (MM), pages 1411–1419, 2020. doi: 10.1145/3394171.3413651

work page doi:10.1145/3394171.3413651 2020
[47]

Emerging properties in self-supervised vision transformers

Ilke Demir, Umur Aybars Ciftci, and Lijun Yin. Fakecatcher: Detection of synthetic portrait videos using biological signals. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1432–1441, 2021. doi: 10.1109/ICCV48922.2021.00147

work page doi:10.1109/iccv48922.2021.00147 2021
[48]

Capturing the lighting inconsistency for deepfake detection

Wei-Ming Zhang, Xue-Jie Zhang, and Yu-Feng Li. Capturing the lighting inconsistency for deepfake detection. InProceedings of the European Conference on Computer Vision (ECCV), pages 812–828, 2022. doi: 10.1007/978-3-031-06788-4_52

work page doi:10.1007/978-3-031-06788-4_52 2022
[49]

Illumination enlightened spatial-temporal inconsistency for deepfake video detection

Wei-Ming Zhang, Xue-Jie Zhang, and Yu-Feng Li. Illumination enlightened spatial-temporal inconsistency for deepfake video detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12345–12354, 2023. doi: 10.1109/CVPR2023.01234

work page doi:10.1109/cvpr2023.01234 2023
[50]

Lideepdet: Deepfake detection via image decomposition and inconsistency analysis.Electronics, 13(22):4466, 2023

Wei-Ming Zhang, Xue-Jie Zhang, and Yu-Feng Li. Lideepdet: Deepfake detection via image decomposition and inconsistency analysis.Electronics, 13(22):4466, 2023. doi: 10.3390/electronics13224466

work page doi:10.3390/electronics13224466 2023
[51]

Deep- fake detection by exploiting surface anomalies: The surfake approach.arXiv preprint arXiv:2310.20621, 2023

Andrea Ciamarra, Roberto Caldelli, Federico Becattini, Lorenzo Seidenari, and Alberto Del Bimbo. Deep- fake detection by exploiting surface anomalies: The surfake approach.arXiv preprint arXiv:2310.20621, 2023

work page arXiv 2023
[52]

IEEE Transactions on Multimedia 25, 942– 952 (2020) https://doi.org/10.1109/tmm

Zongmei Chen, Xin Liao, and Xiaoshuai Wu. Aim-bone: Texture discrepancy generation and localization for generalized deepfake detection.IEEE Transactions on Multimedia, 25:1–13, 2023. doi: 10.1109/TMM. 2023.3245678

work page doi:10.1109/tmm 2023
[53]

Deepfake detection and localization using multi-view inconsistency measurement.IEEE Transactions on Multimedia, 25:1–12, 2023

Zhiwei Xiong, Wei Wang, and Xiaochun Cao. Deepfake detection and localization using multi-view inconsistency measurement.IEEE Transactions on Multimedia, 25:1–12, 2023. doi: 10.1109/TMM.2023. 3234567

work page doi:10.1109/tmm.2023 2023
[54]

Artifacts-disentangled adversarial learning for deepfake detection

Haodong Li, Bin Li, and Shunquan Tan. Artifacts-disentangled adversarial learning for deepfake detection. IEEE Transactions on Information Forensics and Security, 17:1–14, 2022. doi: 10.1109/TIFS.2022. 3142356

work page doi:10.1109/tifs.2022 2022
[55]

Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PloS one, 13(5):e0196391, 2018

work page 2018
[56]

Crema-d: Crowd-sourced emotional multimodal actors dataset.IEEE transactions on affective computing, 5(4):377–390, 2014

Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. Crema-d: Crowd-sourced emotional multimodal actors dataset.IEEE transactions on affective computing, 5(4):377–390, 2014

work page 2014
[57]

Synergizing motion and appearance: Multi- scale compensatory codebooks for talking head video generation.arXiv preprint arXiv:2412.00719, 2024

Shuling Zhao, Fa-Ting Hong, Xiaoshui Huang, and Dan Xu. Synergizing motion and appearance: Multi- scale compensatory codebooks for talking head video generation.arXiv preprint arXiv:2412.00719, 2024

work page arXiv 2024
[58]

Say anything with any style

Shuai Tan, Bin Ji, Yu Ding, and Ye Pan. Say anything with any style. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5088–5096, 2024

work page 2024
[59]

Cvthead: One-shot con- trollable head avatar with vertex-feature transformer

Haoyu Ma, Tong Zhang, Shanlin Sun, Xiangyi Yan, Kun Han, and Xiaohui Xie. Cvthead: One-shot con- trollable head avatar with vertex-feature transformer. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6131–6141, 2024

work page 2024
[60]

Learning dynamic facial radiance fields for few-shot talking head synthesis

Shuai Shen, Wanhua Li, Zheng Zhu, Yueqi Duan, Jie Zhou, and Jiwen Lu. Learning dynamic facial radiance fields for few-shot talking head synthesis. InEuropean conference on computer vision, pages 666–682. Springer, 2022

work page 2022
[61]

Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8652–8661, 2023. 13

work page 2023
[62]

Latent image animator: Learning to animate images via latent space navigation.arXiv preprint arXiv:2203.09043, 2022

Yaohui Wang, Di Yang, Francois Bremond, and Antitza Dantcheva. Latent image animator: Learning to animate images via latent space navigation.arXiv preprint arXiv:2203.09043, 2022

work page arXiv 2022
[63]

Audio-visual face reenactment

Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. Audio-visual face reenactment. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5178–5187, 2023

work page 2023
[64]

Cosface: Large margin cosine loss for deep face recognition

Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5265–5274, 2018

work page 2018
[65]

Adaface: Quality adaptive margin for face recognition

Minchul Kim, Anil K Jain, and Xiaoming Liu. Adaface: Quality adaptive margin for face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18750–18759, 2022

work page 2022
[66]

Generalization analysis for contrastive representation learning

Yunwen Lei, Tianbao Yang, Yiming Ying, and Ding-Xuan Zhou. Generalization analysis for contrastive representation learning. InInternational Conference on Machine Learning, pages 19200–19227. PMLR, 2023

work page 2023
[67]

Understanding contrastive learning through the lens of margins.arXiv preprint arXiv:2306.11526, 2023

Daniel Rho, TaeSoo Kim, Sooill Park, Jaehyun Park, and JaeHan Park. Understanding contrastive learning through the lens of margins.arXiv preprint arXiv:2306.11526, 2023

work page arXiv 2023
[68]

3dfaceshop: Explicitly controllable 3d-aware portrait generation.IEEE transactions on visualization and computer graphics, 30(9):6020–6037, 2023

Junshu Tang, Bo Zhang, Binxin Yang, Ting Zhang, Dong Chen, Lizhuang Ma, and Fang Wen. 3dfaceshop: Explicitly controllable 3d-aware portrait generation.IEEE transactions on visualization and computer graphics, 30(9):6020–6037, 2023

work page 2023
[69]

Implicit identity representation conditioned memory compensation network for talking head video generation

Fa-Ting Hong and Dan Xu. Implicit identity representation conditioned memory compensation network for talking head video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23062–23072, 2023

work page 2023
[70]

Emoportraits: Emotion-enhanced multimodal one-shot head avatars

Nikita Drobyshev, Antoni Bigata Casademunt, Konstantinos V ougioukas, Zoe Landgraf, Stavros Petridis, and Maja Pantic. Emoportraits: Emotion-enhanced multimodal one-shot head avatars. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8498–8507, 2024

work page 2024
[71]

Finding directions in gan’s latent space for neural face reenactment.arXiv preprint arXiv:2202.00046, 2022

Stella Bounareli, Vasileios Argyriou, and Georgios Tzimiropoulos. Finding directions in gan’s latent space for neural face reenactment.arXiv preprint arXiv:2202.00046, 2022

work page arXiv 2022
[72]

Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168,

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168, 2024. 14 Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing Supplementary Material A. Addit...

work page arXiv 2024

[1] [1]

Synthesizing realistic avatars for enhanced virtual communication.IEEE Transactions on Visualization and Computer Graphics, 29(5):1802–1815, 2023

Hao Wang and Li Zhang. Synthesizing realistic avatars for enhanced virtual communication.IEEE Transactions on Visualization and Computer Graphics, 29(5):1802–1815, 2023. doi: 10.1109/TVCG.2023. 3214567. URLhttps://ieeexplore.ieee.org/document/3214567

work page doi:10.1109/tvcg.2023 2023

[2] [2]

Generative adversarial networks for hyper-realistic avatar creation

Min-Jun Lee and Soo-Young Kim. Generative adversarial networks for hyper-realistic avatar creation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1234–1243, 2022. doi: 10.1109/CVPR.2022.001234. URL https://ieeexplore.ieee.org/ document/001234

work page doi:10.1109/cvpr.2022.001234 2022

[3] [3]

Deep learning techniques for avatar-based interaction in virtual environments

John Smith and Jane Doe. Deep learning techniques for avatar-based interaction in virtual environments. IEEE Transactions on Neural Networks and Learning Systems, 32(12):5600–5612, 2021. doi: 10.1109/ TNNLS.2021.3071234. URLhttps://ieeexplore.ieee.org/document/3071234

work page arXiv 2021

[4] [4]

Ai-mediated 3d video conferencing

Michael Stengel, Koki Nagano, Chao Liu, Matthew Chan, Alex Trevithick, Shalini De Mello, Jonghyun Kim, and David Luebke. Ai-mediated 3d video conferencing. InACM SIGGRAPH Emerging Technologies,

work page

[5] [5]

doi: 10.1145/3588037.3595385

work page doi:10.1145/3588037.3595385

[6] [6]

Shwetha Rajaram, Nels Numan, Balasaravanan Thoravi Kumaravel, Nicolai Marquardt, and Andrew D. Wilson. Blendscape: Enabling unified and personalized video-conferencing environments through genera- tive ai.arXiv preprint arXiv:2403.13947, 2024

work page arXiv 2024

[7] [7]

Faivconf: Face enhancement for ai-based video conference with low bit-rate.arXiv preprint arXiv:2207.04090, 2022

Zhengang Li, Sheng Lin, Shan Liu, Songnan Li, Xue Lin, Wei Wang, and Wei Jiang. Faivconf: Face enhancement for ai-based video conference with low bit-rate.arXiv preprint arXiv:2207.04090, 2022

work page arXiv 2022

[8] [8]

Multimodal active speaker detection and virtual cinematography for video conferencing.arXiv preprint arXiv:2002.03977, 2020

Ross Cutler, Ramin Mehran, Sam Johnson, Cha Zhang, Adam Kirk, Oliver Whyte, and Adarsh Kowdle. Multimodal active speaker detection and virtual cinematography for video conferencing.arXiv preprint arXiv:2002.03977, 2020. 10

work page arXiv 2002

[9] [9]

In: CVPR

Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferencing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10039–10049, 2021. doi: 10.1109/CVPR46437.2021.00991

work page doi:10.1109/cvpr46437.2021.00991 2021

[10] [10]

Multimodal semantic communication for generative audio-driven video conferencing.arXiv preprint arXiv:2410.22112, 2024

Haonan Tong, Haopeng Li, Hongyang Du, Zhaohui Yang, Changchuan Yin, and Dusit Niyato. Multimodal semantic communication for generative audio-driven video conferencing.arXiv preprint arXiv:2410.22112, 2024

work page arXiv 2024

[11] [11]

Defending low-bandwidth talking head videoconferencing systems from real-time puppeteering attacks

Danial Samadi Vahdati, Tai Duc Nguyen, and Matthew C Stamm. Defending low-bandwidth talking head videoconferencing systems from real-time puppeteering attacks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 983–992, 2023

work page 2023

[12] [12]

Avatar fingerprinting for authorized use of synthetic talking-head videos

Ekta Prashnani, Koki Nagano, Shalini De Mello, David Luebke, and Orazio Gallo. Avatar fingerprinting for authorized use of synthetic talking-head videos. InEuropean Conference on Computer Vision, pages 209–228. Springer, 2024

work page 2024

[13] [13]

Combining efficientnet and vision transformers for video deepfake detection

Davide Alessandro Coccomini, Nicola Messina, Claudio Gennaro, and Fabrizio Falchi. Combining efficientnet and vision transformers for video deepfake detection. InInternational conference on image analysis and processing, pages 219–229. Springer, 2022

work page 2022

[14] [14]

Video face manipulation detection through ensemble of cnns

Nicolo Bonettini, Edoardo Daniele Cannas, Sara Mandelli, Luca Bondi, Paolo Bestagini, and Stefano Tubaro. Video face manipulation detection through ensemble of cnns. In2020 25th international conference on pattern recognition (ICPR), pages 5012–5019. IEEE, 2021

work page 2021

[15] [15]

Tall: Thumbnail layout for deepfake video detection

Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, and Ran He. Tall: Thumbnail layout for deepfake video detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 22658–22668, 2023

work page 2023

[16] [16]

Supervised contrastive learning for generalizable and explainable deepfakes detection

Ying Xu, Kiran Raja, and Marius Pedersen. Supervised contrastive learning for generalizable and explainable deepfakes detection. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 379–389, 2022

work page 2022

[17] [17]

Laa-net: Localized artifact attention network for quality-agnostic and generalizable deepfake detection

Dat Nguyen, Nesryne Mejri, Inder Pal Singh, Polina Kuleshova, Marcella Astrid, Anis Kacem, Enjie Ghorbel, and Djamila Aouada. Laa-net: Localized artifact attention network for quality-agnostic and generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17395–17405, 2024

work page 2024

[18] [18]

Videofact: detecting video forgeries using attention, scene context, and forensic traces

Tai D Nguyen, Shengbang Fang, and Matthew C Stamm. Videofact: detecting video forgeries using attention, scene context, and forensic traces. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8563–8573, 2024

work page 2024

[19] [19]

Nguyen, Aref Azizpour, and Matthew C

Danial Samadi Vahdati, Tai D. Nguyen, Aref Azizpour, and Matthew C. Stamm. Beyond deepfake images: Detecting ai-generated videos.arXiv preprint arXiv:2404.15955, 2024

work page arXiv 2024

[20] [20]

Distinguish any fake videos: Unleashing the power of large-scale data and motion features.arXiv preprint arXiv:2405.15343, 2024

Lichuan Ji, Yingqi Lin, Zhenhua Huang, Yan Han, Xiaogang Xu, Jiafei Wu, Chong Wang, and Zhe Liu. Distinguish any fake videos: Unleashing the power of large-scale data and motion features.arXiv preprint arXiv:2405.15343, 2024

work page arXiv 2024

[21] [21]

What matters in detecting ai-generated videos like sora?

Chirui Chang, Zhengzhe Liu, Xiaoyang Lyu, and Xiaojuan Qi. What matters in detecting ai-generated videos like sora?arXiv preprint arXiv:2406.19568, 2024

work page arXiv 2024

[22] [22]

Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content.arXiv preprint arXiv:2412.12278, 2024

Yuezun Li, Ming-Ching Chang, and Siwei Lyu. Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content.arXiv preprint arXiv:2412.12278, 2024

work page arXiv 2024

[23] [23]

Generalizable and animatable gaussian head avatar.Advances in Neural Information Processing Systems, 37:57642–57670, 2024

Xuangeng Chu and Tatsuya Harada. Generalizable and animatable gaussian head avatar.Advances in Neural Information Processing Systems, 37:57642–57670, 2024

work page 2024

[24] [24]

Emoca: Emotion driven monocular face capture and animation

Radek Danˇeˇcek, Michael J Black, and Timo Bolkart. Emoca: Emotion driven monocular face capture and animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20311–20322, 2022

work page 2022

[25] [25]

Audio2head: Audio-driven one-shot talking-head generation with natural head motion

Suzhen Wang, Lincheng Li, Yu Ding, Changjie Fan, and Xin Yu. Audio2head: Audio-driven one-shot talking-head generation with natural head motion. InProceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI), pages 1107–1113, 2021. doi: 10.24963/ijcai.2021/152

work page doi:10.24963/ijcai.2021/152 2021

[26] [26]

Emotalker: Audio driven emotion aware talking head generation

Xiaoqian Shen, Yantong Wang, Zhenhua Liu, and Zhiyong Wang. Emotalker: Audio driven emotion aware talking head generation. InProceedings of the 17th Asian Conference on Computer Vision (ACCV), pages 123–137, 2024. doi: 10.1007/978-3-031-68418-0_9. 11

work page doi:10.1007/978-3-031-68418-0_9 2024

[27] [27]

Expressive talking head video encoding in stylegan2 latent space.arXiv preprint arXiv:2203.14512, 2022

Trevine Oorloff and Yaser Yacoob. Expressive talking head video encoding in stylegan2 latent space.arXiv preprint arXiv:2203.14512, 2022

work page arXiv 2022

[28] [28]

Talking-head generation with rhythmic head motion

Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. Talking-head generation with rhythmic head motion. InProceedings of the European Conference on Computer Vision (ECCV), pages 35–51, 2020

work page 2020

[29] [29]

Namboodiri, and C.V

Madhav Agarwal, Anchit Gupta, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V . Jawahar. Compressing video calls using synthetic talking heads. InProceedings of the British Machine Vision Conference (BMVC), pages 1–12, 2021

work page 2021

[30] [30]

Depth-aware generative adversarial network for talking head video generation

Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu. Depth-aware generative adversarial network for talking head video generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3397–3406, 2022

work page 2022

[31] [31]

Dagan++: Depth-aware generative adversarial network for talking head video generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):2997–3012, 2023

Fa-Ting Hong, Li Shen, and Dan Xu. Dagan++: Depth-aware generative adversarial network for talking head video generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):2997–3012, 2023

work page 2023

[32] [32]

Implicit warping for animation with image sets

Arun Mallya, Ting-Chun Wang, and Ming-Yu Liu. Implicit warping for animation with image sets. Advances in Neural Information Processing Systems, 35:22438–22450, 2022

work page 2022

[33] [33]

Anifacegan: Animatable 3d-aware face image generation for video avatars.Advances in Neural Information Processing Systems, 35:36188–36201, 2022

Yue Wu, Yu Deng, Jiaolong Yang, Fangyun Wei, Qifeng Chen, and Xin Tong. Anifacegan: Animatable 3d-aware face image generation for video avatars.Advances in Neural Information Processing Systems, 35:36188–36201, 2022

work page 2022

[34] [34]

Talking face generation with multilingual tts

Hyoung-Kyu Song, Sang Hoon Woo, Junhyeok Lee, Seungmin Yang, Hyunjae Cho, Youseong Lee, Dongho Choi, and Kang-wook Kim. Talking face generation with multilingual tts. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 21425–21430, 2022

work page 2022

[35] [35]

High-fidelity and freely controllable talking head video generation

Yue Gao, Yuan Zhou, Jinglu Wang, Xiao Li, Xiang Ming, and Yan Lu. High-fidelity and freely controllable talking head video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10039–10049, 2023. doi: 10.1109/CVPR2023.00991

work page doi:10.1109/cvpr2023.00991 2023

[36] [36]

Discohead: Audio-and-video-driven talking head generation by disentangled control of head pose and facial expressions

Geumbyeol Hwang, Sunwon Hong, Seunghyun Lee, Sungwoo Park, and Gyeongsu Chae. Discohead: Audio-and-video-driven talking head generation by disentangled control of head pose and facial expressions. arXiv preprint arXiv:2303.07697, 2023

work page arXiv 2023

[37] [37]

Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion.arXiv preprint arXiv:2411.16726, 2024

Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, Bing Yin, Cong Liu, and Qingfeng Liu. Emotivetalk: Expressive talking head generation through audio information decoupling and emotional video diffusion.arXiv preprint arXiv:2411.16726, 2024

work page arXiv 2024

[38] [38]

Interactive conversational head generation.arXiv preprint arXiv:2307.02090, 2023

Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, Bing Yin, Cong Liu, and Qingfeng Liu. Interactive conversational head generation.arXiv preprint arXiv:2307.02090, 2023

work page arXiv 2023

[39] [39]

R2-talker: Realistic real-time talking head synthesis with hash grid landmarks encoding and progressive multilayer conditioning.arXiv preprint arXiv:2312.05572, 2023

Zhiling Ye, LiangGuo Zhang, Dingheng Zeng, Quan Lu, and Ning Jiang. R2-talker: Realistic real-time talking head synthesis with hash grid landmarks encoding and progressive multilayer conditioning.arXiv preprint arXiv:2312.05572, 2023

work page arXiv 2023

[40] [40]

Efficient conditioned face animation using frontally-viewed embedding.arXiv preprint arXiv:2203.08765, 2022

Maxime Oquab, Daniel Haziza, Ludovic Schwartz, Tao Xu, Katayoun Zand, Rui Wang, Peirong Liu, and Camille Couprie. Efficient conditioned face animation using frontally-viewed embedding.arXiv preprint arXiv:2203.08765, 2022

work page arXiv 2022

[41] [41]

arXiv preprint arXiv:2301.12345 , year =

John Doe, Jane Smith, and Alan Turing. Txt2vid: Ultra-low bitrate compression of talking-head videos via text.arXiv preprint arXiv:2301.12345, 2023

work page arXiv 2023

[42] [42]

Goldman, Supreeth Achar, Gregory Major Blascovich, Joseph G

Jason Lawrence, Dan B. Goldman, Supreeth Achar, Gregory Major Blascovich, Joseph G. Desloge, Tommy Fortes, Eric M. Gomez, Sascha Häberling, Hugues Hoppe, Andy Huibers, Claude Knaus, Brian Kuschak, Ricardo Martin-Brualla, Harris Nover, Andrew Ian Russell, Steven M. Seitz, and Kevin Tong. Project starline: A high-fidelity telepresence system.ACM Transaction...

work page doi:10.1145/3478513.3480490 2021

[43] [43]

Fakeout: Leveraging out-of-domain self-supervision for multi-modal video deepfake detection.arXiv preprint arXiv:2212.00773, 2022

Gil Knafo and Ohad Fried. Fakeout: Leveraging out-of-domain self-supervision for multi-modal video deepfake detection.arXiv preprint arXiv:2212.00773, 2022. 12

work page arXiv 2022

[44] [44]

Emerging properties in self-supervised vision transformers

Shu Hu, Yuezun Li, and Siwei Lyu. Exposing gan-generated faces using inconsistent corneal specular highlights. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1506–1515, 2021. doi: 10.1109/ICCV48922.2021.00154

work page doi:10.1109/iccv48922.2021.00154 2021

[45] [45]

Detecting deep-fake videos from phoneme- viseme mismatches

Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Detecting deep-fake videos from phoneme- viseme mismatches. InProceedings of the 27th ACM International Conference on Multimedia (MM), pages 1136–1145, 2018. doi: 10.1145/3343031.3350928

work page doi:10.1145/3343031.3350928 2018

[46] [46]

Deeprhythm: Exposing deepfakes with attentional visual heartbeat rhythms

Yuezun Li, Ming-Ching Chang, and Siwei Lyu. Deeprhythm: Exposing deepfakes with attentional visual heartbeat rhythms. InProceedings of the 28th ACM International Conference on Multimedia (MM), pages 1411–1419, 2020. doi: 10.1145/3394171.3413651

work page doi:10.1145/3394171.3413651 2020

[47] [47]

Emerging properties in self-supervised vision transformers

Ilke Demir, Umur Aybars Ciftci, and Lijun Yin. Fakecatcher: Detection of synthetic portrait videos using biological signals. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1432–1441, 2021. doi: 10.1109/ICCV48922.2021.00147

work page doi:10.1109/iccv48922.2021.00147 2021

[48] [48]

Capturing the lighting inconsistency for deepfake detection

Wei-Ming Zhang, Xue-Jie Zhang, and Yu-Feng Li. Capturing the lighting inconsistency for deepfake detection. InProceedings of the European Conference on Computer Vision (ECCV), pages 812–828, 2022. doi: 10.1007/978-3-031-06788-4_52

work page doi:10.1007/978-3-031-06788-4_52 2022

[49] [49]

Illumination enlightened spatial-temporal inconsistency for deepfake video detection

Wei-Ming Zhang, Xue-Jie Zhang, and Yu-Feng Li. Illumination enlightened spatial-temporal inconsistency for deepfake video detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12345–12354, 2023. doi: 10.1109/CVPR2023.01234

work page doi:10.1109/cvpr2023.01234 2023

[50] [50]

Lideepdet: Deepfake detection via image decomposition and inconsistency analysis.Electronics, 13(22):4466, 2023

Wei-Ming Zhang, Xue-Jie Zhang, and Yu-Feng Li. Lideepdet: Deepfake detection via image decomposition and inconsistency analysis.Electronics, 13(22):4466, 2023. doi: 10.3390/electronics13224466

work page doi:10.3390/electronics13224466 2023

[51] [51]

Deep- fake detection by exploiting surface anomalies: The surfake approach.arXiv preprint arXiv:2310.20621, 2023

Andrea Ciamarra, Roberto Caldelli, Federico Becattini, Lorenzo Seidenari, and Alberto Del Bimbo. Deep- fake detection by exploiting surface anomalies: The surfake approach.arXiv preprint arXiv:2310.20621, 2023

work page arXiv 2023

[52] [52]

IEEE Transactions on Multimedia 25, 942– 952 (2020) https://doi.org/10.1109/tmm

Zongmei Chen, Xin Liao, and Xiaoshuai Wu. Aim-bone: Texture discrepancy generation and localization for generalized deepfake detection.IEEE Transactions on Multimedia, 25:1–13, 2023. doi: 10.1109/TMM. 2023.3245678

work page doi:10.1109/tmm 2023

[53] [53]

Deepfake detection and localization using multi-view inconsistency measurement.IEEE Transactions on Multimedia, 25:1–12, 2023

Zhiwei Xiong, Wei Wang, and Xiaochun Cao. Deepfake detection and localization using multi-view inconsistency measurement.IEEE Transactions on Multimedia, 25:1–12, 2023. doi: 10.1109/TMM.2023. 3234567

work page doi:10.1109/tmm.2023 2023

[54] [54]

Artifacts-disentangled adversarial learning for deepfake detection

Haodong Li, Bin Li, and Shunquan Tan. Artifacts-disentangled adversarial learning for deepfake detection. IEEE Transactions on Information Forensics and Security, 17:1–14, 2022. doi: 10.1109/TIFS.2022. 3142356

work page doi:10.1109/tifs.2022 2022

[55] [55]

Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PloS one, 13(5):e0196391, 2018

work page 2018

[56] [56]

Crema-d: Crowd-sourced emotional multimodal actors dataset.IEEE transactions on affective computing, 5(4):377–390, 2014

Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. Crema-d: Crowd-sourced emotional multimodal actors dataset.IEEE transactions on affective computing, 5(4):377–390, 2014

work page 2014

[57] [57]

Synergizing motion and appearance: Multi- scale compensatory codebooks for talking head video generation.arXiv preprint arXiv:2412.00719, 2024

Shuling Zhao, Fa-Ting Hong, Xiaoshui Huang, and Dan Xu. Synergizing motion and appearance: Multi- scale compensatory codebooks for talking head video generation.arXiv preprint arXiv:2412.00719, 2024

work page arXiv 2024

[58] [58]

Say anything with any style

Shuai Tan, Bin Ji, Yu Ding, and Ye Pan. Say anything with any style. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5088–5096, 2024

work page 2024

[59] [59]

Cvthead: One-shot con- trollable head avatar with vertex-feature transformer

Haoyu Ma, Tong Zhang, Shanlin Sun, Xiangyi Yan, Kun Han, and Xiaohui Xie. Cvthead: One-shot con- trollable head avatar with vertex-feature transformer. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6131–6141, 2024

work page 2024

[60] [60]

Learning dynamic facial radiance fields for few-shot talking head synthesis

Shuai Shen, Wanhua Li, Zheng Zhu, Yueqi Duan, Jie Zhou, and Jiwen Lu. Learning dynamic facial radiance fields for few-shot talking head synthesis. InEuropean conference on computer vision, pages 666–682. Springer, 2022

work page 2022

[61] [61]

Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8652–8661, 2023. 13

work page 2023

[62] [62]

Latent image animator: Learning to animate images via latent space navigation.arXiv preprint arXiv:2203.09043, 2022

Yaohui Wang, Di Yang, Francois Bremond, and Antitza Dantcheva. Latent image animator: Learning to animate images via latent space navigation.arXiv preprint arXiv:2203.09043, 2022

work page arXiv 2022

[63] [63]

Audio-visual face reenactment

Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. Audio-visual face reenactment. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5178–5187, 2023

work page 2023

[64] [64]

Cosface: Large margin cosine loss for deep face recognition

Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5265–5274, 2018

work page 2018

[65] [65]

Adaface: Quality adaptive margin for face recognition

Minchul Kim, Anil K Jain, and Xiaoming Liu. Adaface: Quality adaptive margin for face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18750–18759, 2022

work page 2022

[66] [66]

Generalization analysis for contrastive representation learning

Yunwen Lei, Tianbao Yang, Yiming Ying, and Ding-Xuan Zhou. Generalization analysis for contrastive representation learning. InInternational Conference on Machine Learning, pages 19200–19227. PMLR, 2023

work page 2023

[67] [67]

Understanding contrastive learning through the lens of margins.arXiv preprint arXiv:2306.11526, 2023

Daniel Rho, TaeSoo Kim, Sooill Park, Jaehyun Park, and JaeHan Park. Understanding contrastive learning through the lens of margins.arXiv preprint arXiv:2306.11526, 2023

work page arXiv 2023

[68] [68]

3dfaceshop: Explicitly controllable 3d-aware portrait generation.IEEE transactions on visualization and computer graphics, 30(9):6020–6037, 2023

Junshu Tang, Bo Zhang, Binxin Yang, Ting Zhang, Dong Chen, Lizhuang Ma, and Fang Wen. 3dfaceshop: Explicitly controllable 3d-aware portrait generation.IEEE transactions on visualization and computer graphics, 30(9):6020–6037, 2023

work page 2023

[69] [69]

Implicit identity representation conditioned memory compensation network for talking head video generation

Fa-Ting Hong and Dan Xu. Implicit identity representation conditioned memory compensation network for talking head video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23062–23072, 2023

work page 2023

[70] [70]

Emoportraits: Emotion-enhanced multimodal one-shot head avatars

Nikita Drobyshev, Antoni Bigata Casademunt, Konstantinos V ougioukas, Zoe Landgraf, Stavros Petridis, and Maja Pantic. Emoportraits: Emotion-enhanced multimodal one-shot head avatars. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8498–8507, 2024

work page 2024

[71] [71]

Finding directions in gan’s latent space for neural face reenactment.arXiv preprint arXiv:2202.00046, 2022

Stella Bounareli, Vasileios Argyriou, and Georgios Tzimiropoulos. Finding directions in gan’s latent space for neural face reenactment.arXiv preprint arXiv:2202.00046, 2022

work page arXiv 2022

[72] [72]

Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168,

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168, 2024. 14 Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing Supplementary Material A. Addit...

work page arXiv 2024