The DeepSpeak Dataset

Hany Farid; Maty Bohacek; Sarah Barrington

arxiv: 2408.05366 · v5 · submitted 2024-08-09 · 💻 cs.CV

The DeepSpeak Dataset

Sarah Barrington , Maty Bohacek , Hany Farid This is my paper

Pith reviewed 2026-05-23 21:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords deepfake detectiondeepfake datasettalking headsvideo synthesisvoice cloningidentity matchingmultimodal contentadversarial attacks

0 comments

The pith

Current deepfake detectors fail to generalize to a new dataset of high-quality talking-head fakes without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the DeepSpeak dataset of over 100 hours of real and deepfake audiovisual content focused on talking heads. Real recordings come from 500 consenting participants while fakes are produced with 14 video synthesis engines and three voice cloning engines through an embedding-based identity-matching process. Large-scale tests on existing detectors show they do not maintain performance on this material, indicating that training data must incorporate the latest generative tools to address realistic adversarial scenarios such as impostor attacks in video calls.

Core claim

Without retraining, state-of-the-art deepfake detectors fail to generalize to the DeepSpeak dataset, which contains more than 50 hours of self-recorded real data from 500 participants and more than 50 hours of state-of-the-art deepfakes created with 14 video engines, three voice engines, and identity-matched face swaps.

What carries the argument

The DeepSpeak dataset, constructed via embedding-based identity matching to produce realistic audiovisual deepfakes from multiple current synthesis engines.

If this is right

Detectors require retraining on data from the newest synthesis engines to regain effectiveness against talking-head deepfakes.
Identity-matching protocols produce higher-quality simulated attacks than random pairing methods.
Multimodal audiovisual coverage is required to evaluate detectors intended for video-conference and identity-verification settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Video-conference platforms could incorporate periodic retraining on datasets like this one to reduce impostor risks.
Engine-specific failure patterns in the dataset could guide targeted improvements to detection methods.

Load-bearing premise

Deepfakes generated with the 14 video and three voice engines plus identity matching serve as realistic stand-ins for actual adversarial attacks.

What would settle it

Measure the accuracy of the tested detectors on the full DeepSpeak test set and compare it to their reported accuracy on their original training distributions.

Figures

Figures reproduced from arXiv: 2408.05366 by Hany Farid, Maty Bohacek, Sarah Barrington.

**Figure 1.** Figure 1: An overview of the DeepSpeak Dataset sourced from a diverse selection of consenting participants using a custom-built data collection methodology. The dataset also comprises deepfakes generated from 14 video and three audio deepfake methods using facial identity matching to improve the realism of the generated deepfakes. generators) spliced into a subset of the lip-sync deepfakes [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 2.** Figure 2: A t-SNE visualization of CLIP embeddings from real participant’s videos. The four [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of DeepSpeak deepfakes: Face-swap deepfakes replace the facial region from a source identity with the face of a target identity while retaining the audio from the target video. Lip-sync deepfakes overlay a generated mouth region onto the original face and synchronize it with a new audio track (real or fake). Avatar deepfakes animate the head and shoulders of an identity based on a single still fra… view at source ↗

**Figure 4.** Figure 4: A screenshot of the custom-built recording tool used for real participant data collection. [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗

**Figure 18.** Figure 18: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 19.** Figure 19: Additional examples of matched identities from the first release of DeepSpeak data [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗

**Figure 20.** Figure 20: Additional examples of matched identities from the second release of DeepSpeak data [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗

**Figure 21.** Figure 21: Screenshot of the introduction and consent page of the data collection study. [PITH_FULL_IMAGE:figures/full_fig_p042_21.png] view at source ↗

**Figure 22.** Figure 22: Screenshot of the overview and recording instructions page of the data collection study. [PITH_FULL_IMAGE:figures/full_fig_p042_22.png] view at source ↗

**Figure 23.** Figure 23: Screenshot of the environment checks page of the data collection study. [PITH_FULL_IMAGE:figures/full_fig_p043_23.png] view at source ↗

**Figure 24.** Figure 24: Screenshot of the recording test page of the data collection study. [PITH_FULL_IMAGE:figures/full_fig_p043_24.png] view at source ↗

**Figure 25.** Figure 25: Screenshot of an example prompt from the data collection study. [PITH_FULL_IMAGE:figures/full_fig_p044_25.png] view at source ↗

**Figure 26.** Figure 26: Representative examples of different action prompt types. [PITH_FULL_IMAGE:figures/full_fig_p047_26.png] view at source ↗

read the original abstract

Deepfakes represent a growing concern across domains such as disinformation, fraud, and non-consensual media. In particular, the rise of video conference and identity-driven attacks in high-stakes scenarios--such as impostor hiring--demands new forensic resources. Despite significant efforts to develop robust detection classifiers to distinguish the real from the fake, commonly used training datasets remain inadequate: relying on low-quality and outdated deepfake generators, consisting of content scraped from online repositories without participant consent, lacking in multimodal coverage, and rarely employing identity-matching protocols to ensure realistic fakes. To overcome these limitations, we present the DeepSpeak dataset, a diverse and multimodal dataset comprising over 100 hours of authentic and deepfake audiovisual content, specifically focused on the challenging and diverse talking heads context. We contribute: i) more than 50 hours of real, self-recorded data collected from 500 diverse and consenting participants, ii) more than 50 hours of state-of-the-art audio and visual deepfakes generated using 14 video synthesis engines and three voice cloning engines, and iii) an embedding-based, identity-matching approach to ensure the creation of convincing, high-quality identity face swaps that realistically simulate adversarial deepfake attacks. We also perform large-scale evaluations of state-of-the-art deepfake detectors and show that, without retraining, these detectors fail to generalize to this DeepSpeak dataset, highlighting the importance of a large and diverse dataset containing deepfakes from the latest generative-AI tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepSpeak adds a consented, large-scale talking-head dataset with modern generators and identity-matched swaps, but the generalization failure claim risks confounding generator novelty with real-video recording shifts.

read the letter

The main takeaway is that this is a useful dataset release for deepfake detection in video-conference settings. It supplies over 100 hours of real and synthetic audiovisual content from 500 consenting participants, generated with 14 video engines plus three voice cloners, and uses embedding-based identity matching to create realistic face swaps. That combination of consent, scale, multimodal coverage, and current generators goes beyond the scraped, low-quality, non-consensual sets that dominate the field right now. The paper also runs a benchmark showing that existing detectors drop in performance on this data without retraining, which aligns with the motivation for releasing it. Those elements are concrete and address documented gaps in prior resources. The evaluation section is thin on specifics. The abstract asserts failure to generalize but gives no metrics, no list of tested models, and no controls that separate the effect of the new generators from differences in how the real videos were recorded. The real clips come from self-recorded webcams with varied lighting and backgrounds, while many detectors were trained on more controlled corpora such as FaceForensics++. Without ablations that hold the real class fixed and vary only the synthesis method, it is hard to attribute the performance drop cleanly to the generators. That is a fixable issue rather than a fatal one, but it weakens the central empirical claim. This paper is aimed at media-forensics groups that build or test detectors for high-stakes talking-head scenarios. A reader working on dataset construction or on robustness to recent generators will find the release details and the identity-matching protocol worth examining. It is worth sending to peer review because the core contribution is a new resource with clear scope and documented collection choices; the evaluation gaps are addressable in revision and do not undermine the dataset itself.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces the DeepSpeak dataset: over 100 hours of real and deepfake audiovisual talking-head content from 500 consenting participants, generated with 14 video synthesis engines, 3 voice cloning engines, and embedding-based identity matching. It reports large-scale evaluations claiming that state-of-the-art deepfake detectors fail to generalize to this dataset without retraining, underscoring the need for more diverse and modern training resources.

Significance. If the generalization experiments hold after addressing potential confounds, the dataset would provide a valuable, consent-based, multimodal benchmark focused on high-stakes talking-head scenarios and recent generators, potentially advancing detector robustness.

major comments (2)

[Abstract] Abstract: the central claim that 'without retraining, these detectors fail to generalize' is presented without any reported metrics, list of tested models, quantitative results, or description of the evaluation protocol, preventing verification of the empirical finding.
[Evaluation / Results] No section describes explicit controls, ablations, or matched real-video subsets to isolate the contribution of the 14 new synthesis engines from domain shift in the real class (self-recorded webcam videos with varied lighting/backgrounds versus the controlled corpora on which most detectors were trained, e.g., FaceForensics++). This attribution is load-bearing for the paper's main conclusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the two major comments point by point below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'without retraining, these detectors fail to generalize' is presented without any reported metrics, list of tested models, quantitative results, or description of the evaluation protocol, preventing verification of the empirical finding.

Authors: The abstract is a concise summary and therefore omits the full list of models, exact metrics, and protocol details. These are provided in the Evaluation section of the manuscript, which reports results across multiple state-of-the-art detectors, the evaluation protocol (cross-dataset testing without retraining), and quantitative outcomes demonstrating poor generalization. To improve verifiability from the abstract itself, we will add a sentence summarizing the key quantitative finding (e.g., average AUC drop) while respecting length limits. revision: yes
Referee: [Evaluation / Results] No section describes explicit controls, ablations, or matched real-video subsets to isolate the contribution of the 14 new synthesis engines from domain shift in the real class (self-recorded webcam videos with varied lighting/backgrounds versus the controlled corpora on which most detectors were trained, e.g., FaceForensics++). This attribution is load-bearing for the paper's main conclusion.

Authors: We acknowledge that the current manuscript does not contain explicit ablations or matched real-video subsets that would fully isolate the effect of the 14 synthesis engines from potential domain shift in the real videos. The evaluation instead emphasizes end-to-end generalization failure on modern, identity-matched deepfakes generated from the new real recordings. We agree this is a substantive point and will add a dedicated ablation subsection using matched real subsets drawn from controlled corpora (e.g., FaceForensics++) to better attribute performance drops to the new generators versus real-video domain differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release and benchmark

full rationale

The paper is a dataset contribution plus empirical evaluation of existing detectors on new data. No derivations, equations, fitted parameters renamed as predictions, or first-principles claims appear in the abstract or described content. The central result (detectors fail to generalize) is a direct empirical measurement, not a reduction to self-defined inputs or self-citations. The work is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no new mathematical parameters, axioms, or invented entities; it relies on existing generative-AI tools and standard data-collection practices.

pith-pipeline@v0.9.0 · 5791 in / 1142 out tokens · 34319 ms · 2026-05-23T21:44:11.068351+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scalable, Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection
cs.CV 2026-05 conditional novelty 6.0

Hybrid optical-digital architecture multiplexes 15+ video streams for parallel deepfake detection, reporting 97.79% average accuracy on Celeb-DF with resilience to degradation and attacks.
The Alpha Blending Hypothesis: Compositing Shortcut in Deepfake Detection
cs.CV 2026-05 unverdicted novelty 6.0

Deepfake detectors act as alpha blending searchers; training solely on self-blended real images yields top cross-dataset generalization on 15 datasets without using synthetic deepfakes.
Deepfake Detection that Generalizes Across Benchmarks
cs.CV 2025-08 accept novelty 6.0

GenD achieves state-of-the-art average cross-dataset AUROC in deepfake detection by parameter-efficient adaptation of a foundational vision encoder with hyperspherical manifold enforcement via L2 normalization and met...
Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection
cs.CV 2026-05 unverdicted novelty 5.0

Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.
EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection
cs.AI 2026-05 unverdicted novelty 4.0

Emo-Boost augments low-level deepfake detectors with intra- and inter-modal emotion consistency checks to raise cross-manipulation generalization AUC by 2.1% on FakeAVCeleb.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 5 Pith papers · 3 internal anchors

[1]

Deep audio-visual speech recognition

Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8717–8727, 2018

work page 2018
[2]

XLS-R: Self-supervised cross-lingual speech representation learning at scale

Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, and Michael Auli. XLS-R: Self-supervised cross-lingual speech representation learning at scale. In Interspeech, pages 2278–2282, 2022

work page 2022
[3]

Single and multi-speaker cloned voice detection: From perceptual to learned features

Sarah Barrington, Romit Barua, Gautham Koorma, and Hany Farid. Single and multi-speaker cloned voice detection: From perceptual to learned features. InIEEE International Workshop on Information Forensics and Security, pages 1–6. IEEE, 2023

work page 2023
[4]

People are poorly equipped to detect AI-powered voice clones

Sarah Barrington, Emily A Cooper, and Hany Farid. People are poorly equipped to detect AI-powered voice clones. Scientific Reports, 15(1):11004, 2025

work page 2025
[5]

Deepfakes and synthetic media in the financial system: Assessing threat scenarios

Jon Bateman. Deepfakes and synthetic media in the financial system: Assessing threat scenarios. Carnegie Endowment for International Peace., 2022

work page 2022
[6]

FreqNet: A frequency-domain image super-resolution network with dicrete cosine transform

Runyuan Cai, Yue Ding, and Hongtao Lu. FreqNet: A frequency-domain image super-resolution network with dicrete cosine transform. 2021

work page 2021
[7]

Do you really mean that? Content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization

Zhixi Cai, Kalin Stefanov, Abhinav Dhall, and Munawar Hayat. Do you really mean that? Content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization. In International Conference on Digital Image Computing: Techniques and Applications), pages 1–10. IEEE, 2022

work page 2022
[8]

Simswap: An efficient framework for high fidelity face swapping

Renwang Chen, Xuanhong Chen, Bingbing Ni, and Yanhao Ge. Simswap: An efficient framework for high fidelity face swapping. In Proceedings of the 28th ACM international conference on multimedia, pages 2003–2011, 2020

work page 2003
[9]

VideoRetalking: Audio-based lip synchronization for talking head video editing in the wild

Kun Cheng, Xiaodong Cun, Yong Zhang, Menghan Xia, Fei Yin, Mingrui Zhu, Xuan Wang, Jue Wang, and Nannan Wang. VideoRetalking: Audio-based lip synchronization for talking head video editing in the wild. InSIGGRAPH Asia, pages 1–9, 2022

work page 2022
[10]

Deep fakes: A looming challenge for privacy, democracy, and national security

Bobby Chesney and Danielle Citron. Deep fakes: A looming challenge for privacy, democracy, and national security. Calif. L. Rev., 107:1753, 2019

work page 2019
[11]

V oxceleb2: Deep speaker recognition,

Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. V oxCeleb2: Deep speaker recognition. arXiv:1806.05622, 2018

work page arXiv 2018
[12]

The malicious technical ecosystem: Exposing limi- tations in technical governance of ai-generated non-consensual intimate images of adults

Michelle L Ding and Harini Suresh. The malicious technical ecosystem: Exposing limi- tations in technical governance of ai-generated non-consensual intimate images of adults. arXiv:2504.17663, 2025

work page arXiv 2025
[13]

The DeepFake Detection Challenge (DFDC) Dataset

Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cris- tian Canton Ferrer. The deepfake detection challenge (DFDC) dataset. arXiv:2006.07397, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[14]

Contributing data to deepfake detection research

Nick Dufour and Andrew Gully. Contributing data to deepfake detection research. Google AI Blog, 2019

work page 2019
[15]

Charting the landscape of nefarious uses of generative artificial intelligence for online election interference

Emilio Ferrara. Charting the landscape of nefarious uses of generative artificial intelligence for online election interference. arXiv:2406.01862, 2024

work page arXiv 2024
[16]

J. H. Frank and L. Schönherr. WaveFake: A Data Set to Facilitate Audio DeepFake Detection. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, pages 1–18, June 2021

work page 2021
[17]

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. DARPA TIMIT acoustic phonetic continuous speech corpus, 1993. 10

work page 1993
[18]

Deepfake detection by human crowds, machines, and machine-informed crowds

Matthew Groh, Ziv Epstein, Chaz Firestone, and Rosalind Picard. Deepfake detection by human crowds, machines, and machine-informed crowds. Proceedings of the National Academy of Sciences, 119(1):e2110013119, 2022

work page 2022
[19]

LivePortrait: Efficient portrait animation with stitching and retargeting control

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. LivePortrait: Efficient portrait animation with stitching and retargeting control. 2024

work page 2024
[20]

ForgeryNet: A versatile benchmark for comprehensive forgery analysis

Yinan He, Bei Gan, Siyu Chen, Yichun Zhou, Guojun Yin, Luchuan Song, Lu Sheng, Jing Shao, and Ziwei Liu. ForgeryNet: A versatile benchmark for comprehensive forgery analysis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4360–4369, 2021

work page 2021
[21]

The lj speech dataset

Keith Ito and Linda Johnson. The lj speech dataset. https://keithito.com/LJ-Speech- Dataset/, 2017

work page 2017
[22]

Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, and Yonghui Wu

Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, and Yonghui Wu. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 4485–4495, Red...

work page 2018
[23]

DeeperForensics-1.0: A large-scale dataset for real-world face forgery detection

Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. DeeperForensics-1.0: A large-scale dataset for real-world face forgery detection. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2889–2898, 2020

work page 2020
[24]

AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks

Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans. AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In IEEE International Conference on Acoustics, Ppeech and Signal Processing, pages 6367–6371, 2022

work page 2022
[25]

FakeA VCeleb: A novel audio-video multimodal deepfake dataset

Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S Woo. FakeA VCeleb: A novel audio-video multimodal deepfake dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

work page 2021
[26]

TitaNet: Neural model for speaker rep- resentation with 1D depth-wise separable convolutions and global context

Nithin Rao Koluguri, Taejin Park, and Boris Ginsburg. TitaNet: Neural model for speaker rep- resentation with 1D depth-wise separable convolutions and global context. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 8102–8106. IEEE, 2022

work page 2022
[27]

Towards automatic face-to-face translation

Prajwal KR, Rudrabha Mukhopadhyay, Jerin Philip, Abhishek Jha, Vinay Namboodiri, and CV Jawahar. Towards automatic face-to-face translation. In27th ACM International Conference on Multimedia, pages 1428–1436, 2019

work page 2019
[28]

LatentSync: Audio conditioned latent diffusion models for lip sync

Chunyu Li, Chao Zhang, Weikai Xu, Jinghui Xie, Weiguo Feng, Bingyue Peng, and Weiwei Xing. LatentSync: Audio conditioned latent diffusion models for lip sync. 2024

work page 2024
[29]

Celeb-DF: A large-scale challenging dataset for deepfake forensics

Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-DF: A large-scale challenging dataset for deepfake forensics. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3207–3216, 2020

work page 2020
[30]

BlendGAN: Implicitly GAN blending for arbitrary stylized face generation.Advances in Neural Information Processing Systems, 34:29710–29722, 2021

Mingcong Liu, Qiang Li, Zekui Qin, Guoxin Zhang, Pengfei Wan, and Wen Zheng. BlendGAN: Implicitly GAN blending for arbitrary stylized face generation.Advances in Neural Information Processing Systems, 34:29710–29722, 2021

work page 2021
[31]

Lips are lying: Spotting the temporal inconsistency between audio and visual in lip-syncing deepfakes.Advances in Neural Information Processing Systems, 37:91131–91155, 2024

Weifeng Liu, Tianyi She, Jiawei Liu, Boheng Li, Dongyu Yao, and Run Wang. Lips are lying: Spotting the temporal inconsistency between audio and visual in lip-syncing deepfakes.Advances in Neural Information Processing Systems, 37:91131–91155, 2024

work page 2024
[32]

Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild.IEEE/ACM Trans

Xuechen Liu, Xin Wang, Md Sahidullah, Jose Patino, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas Evans, Andreas Nautsch, and Kong Aik Lee. Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild.IEEE/ACM Trans. Audio, Speech and Lang. Proc., 31:2507–2522, June 2023

work page 2021
[33]

The Ryerson audio-visual database of emotional speech and song (RA VDESS): A dynamic, multimodal set of facial and vocal expressions in north american english

Steven R Livingstone and Frank A Russo. The Ryerson audio-visual database of emotional speech and song (RA VDESS): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5):e0196391, 2018. 11

work page 2018
[34]

MediaPipe: A Framework for Building Perception Pipelines

Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. MediaPipe: A framework for building perception pipelines. arXiv:1906.08172, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[35]

Diff2Lip: Audio conditioned diffusion models for lip-synchronization

Soumik Mukhopadhyay, Saksham Suri, Ravi Teja Gadde, and Abhinav Shrivastava. Diff2Lip: Audio conditioned diffusion models for lip-synchronization. InIEEE/CVF Winter Conference on Applications of Computer Vision, pages 5292–5302, 2024

work page 2024
[36]

V oxceleb: a large-scale speaker identiﬁcation dataset,

Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. V oxCeleb: A large-scale speaker identification dataset. arXiv:1706.08612, 2017

work page arXiv 2017
[37]

Nightingale and Hany Farid

Sophie J. Nightingale and Hany Farid. AI-synthesized faces are indistinguishable from real faces and more trustworthy.Proceedings of the National Academy of Sciences, 119(8):e2120481119, 2022

work page 2022
[38]

Training-free deepfake voice recognition by leveraging large-scale pre-trained models

Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. Training-free deepfake voice recognition by leveraging large-scale pre-trained models. In ACM Workshop on Information Hiding and Multimedia Security, page 289–294, New York, NY , USA, 2024

work page 2024
[39]

A lip sync expert is all you need for speech to lip generation in the wild

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. In28th ACM International Conference on Multimedia, pages 484–492, 2020

work page 2020
[40]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

work page 2021
[41]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023

work page 2023
[42]

Facial geometric detail recovery via implicit representation

Xingyu Ren, Alexandros Lattas, Baris Gecer, Jiankang Deng, Chao Ma, and Xiaokang Yang. Facial geometric detail recovery via implicit representation. In IEEE 17th International Conference on Automatic Face and Gesture Recognition, 2023

work page 2023
[43]

FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces

Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. FaceForensics: A large-scale video dataset for forgery detection in human faces. arXiv:1803.09179, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

Faceforensics++: Learning to detect manipulated facial images

Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. In IEEE/CVF International Conference on Computer Vision, pages 1–11, 2019

work page 2019
[45]

Ieee recommended practice for speech quality measurements

Ernst H Rothauser. Ieee recommended practice for speech quality measurements. IEEE Transactions on Audio and Electroacoustics, 17(3):225–246, 1969

work page 1969
[46]

FaceFusion

Henry Ruhs. FaceFusion. https://github.com/facefusion/facefusion, 2024

work page 2024
[47]

JSUT corpus: Free large-scale Japanese speech corpus for end-to-end speech synthesis, 2017

Ryosuke Sonobe, Shinnosuke Takamichi, and Hiroshi Saruwatari. JSUT corpus: Free large-scale Japanese speech corpus for end-to-end speech synthesis, 2017. arXiv preprint

work page 2017
[48]

AI-enabled influence operations: Safeguarding future elections

Sam Stockwell, Megan Hughes, Phil Swatton, Albert Zhang, Jonathan Hall KC, and Kieran. AI-enabled influence operations: Safeguarding future elections. Technical report, Centre for Emerging Technology and Security (CETaS), The Alan Turing Institute, November 2024

work page 2024
[49]

MultiTalk: Enhancing 3D talking head generation across languages with multilingual video dataset

Kim Sung-Bin, Lee Chae-Yeon, Gihun Son, Oh Hyun-Bin, Janghoon Ju, Suekyeong Nam, and Tae-Hyun Oh. MultiTalk: Enhancing 3D talking head generation across languages with multilingual video dataset. arXiv:2406.14272, 2024

work page arXiv 2024
[50]

End-to-end spectro-temporal graph attention networks for speaker ver- ification anti-spoofing and speech deepfake detection,

Hemlata Tak, Jee-Weon Jung, Jose Patino, Madhu Kamble, Massimiliano Todisco, and Nicholas Evans. End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection. arXiv:2107.12710, 2021. 12

work page arXiv 2021
[51]

End-to-end anti-spoofing with RawNet2

Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. End-to-end anti-spoofing with RawNet2. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 6369–6373, 2021

work page 2021
[52]

Deepfakes and disinformation: Exploring the impact of synthetic political video on deception, uncertainty, and trust in news.Social media+ society, 6(1):2056305120903408, 2020

Cristian Vaccari and Andrew Chadwick. Deepfakes and disinformation: Exploring the impact of synthetic political video on deception, uncertainty, and trust in news.Social media+ society, 6(1):2056305120903408, 2020

work page 2020
[53]

Designed to abuse? deepfakes and the non-consensual diffusion of intimate images

Marco Viola and Cristina V oto. Designed to abuse? deepfakes and the non-consensual diffusion of intimate images. Synthese, 201(1):30, 2023

work page 2023
[54]

INSwapper: Face swapping model based on insightface

Haofan Wang. INSwapper: Face swapping model based on insightface. https : //github.com/haofanwang/inswapper, 2023

work page 2023
[55]

Mead: A large-scale audio-visual dataset for emotional talking- face generation

Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. Mead: A large-scale audio-visual dataset for emotional talking- face generation. In European Conference on Computer Vision, pages 700–717. Springer, 2020

work page 2020
[56]

RestoreFormer++: Towards real-world blind face restoration from undegraded key-value pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

Zhouxia Wang, Jiawei Zhang, Tianshui Chen, Wenping Wang, and Ping Luo. RestoreFormer++: Towards real-world blind face restoration from undegraded key-value pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

work page 2023
[57]

Speech accent archive, 2015

Steven Weinberger. Speech accent archive, 2015. Retrieved from the Speech Accent Archive

work page 2015
[58]

Deepfake video detection using generative convolutional vision transformer

Deressa Wodajo, Solomon Atnafu, and Zahid Akhtar. Deepfake video detection using generative convolutional vision transformer. 2023

work page 2023
[59]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2023

work page 2023
[60]

VFHQ: A high-quality dataset and benchmark for video face super-resolution

Liangbin Xie, Xintao Wang, Honglun Zhang, Chao Dong, and Ying Shan. VFHQ: A high-quality dataset and benchmark for video face super-resolution. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 657–666, 2022

work page 2022
[61]

DF40: Toward next-generation deepfake detection

Zhiyuan Yan, Taiping Yao, Shen Chen, Yandan Zhao, Xinghe Fu, Junwei Zhu, Donghao Luo, Chengjie Wang, Shouhong Ding, Yunsheng Wu, et al. DF40: Toward next-generation deepfake detection. arXiv:2406.13495, 2024

work page arXiv 2024
[62]

LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild

Shuang Yang, Yuanhang Zhang, Dalu Feng, Mingmin Yang, Chenhao Wang, Jingyun Xiao, Keyu Long, Shiguang Shan, and Xilin Chen. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. In IEEE International Conference on Automatic Face & Gesture Recognition, pages 1–8. IEEE, 2019

work page 2019
[63]

HelloMeme: Integrating spatial knitting attentions to embed high-level and fidelity-rich conditions in diffusion models

Shengkai Zhang, Nianhong Jiao, Tian Li, Chaojie Yang, Chenhui Xue, Boya Niu, and Jun Gao. HelloMeme: Integrating spatial knitting attentions to embed high-level and fidelity-rich conditions in diffusion models. 2024

work page 2024
[64]

Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset

Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021

work page 2021
[65]

MEMO: Memory-guided diffusion for expressive talking video generation

Longtao Zheng, Yifan Zhang, Hanzhong Guo, Jiachun Pan, Zhenxiong Tan, Jiahao Lu, Chuanxin Tang, Bo An, and Shuicheng Yan. MEMO: Memory-guided diffusion for expressive talking video generation. 2024

work page 2024
[66]

Towards robust blind face restoration with codebook lookup transformer

Shangchen Zhou, Kelvin Chan, Chongyi Li, and Chen Change Loy. Towards robust blind face restoration with codebook lookup transformer. Advances in Neural Information Processing Systems, 35:30599–30611, 2022

work page 2022
[67]

Chan, Chongyi Li, and Chen Change Loy

Shangchen Zhou, Kelvin C.K. Chan, Chongyi Li, and Chen Change Loy. Towards robust blind face restoration with codebook lookup transformer. InNeurIPS, 2022. 13

work page 2022
[68]

CelebV-HQ: A large-scale video facial attributes dataset

Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. CelebV-HQ: A large-scale video facial attributes dataset. In European Conference on Computer Vision, pages 650–667. Springer, 2022

work page 2022
[69]

pretrained

Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. WildDeepfake: A challenging real-world dataset for deepfake detection. In28th ACM International Conference on Multimedia, pages 2382–2390, 2020. 14 Appendix Table of Contents A Release and Usage Information 17 B Related Work 18 B.1 Video . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 2020
[70]

LogisticRegression: max iterations = 1000, with all other parameters as per the defaults in the Scikit-learn LogisticRegression model

work page
[71]

d a t a b a s e _ p a t h

RandomForestClassifier: number of estimators = 100, with all other parameters as per the defaults in the Scikit-learn RandomForestClassifier model. 36 L.3 Audio Raw Waveform Pretrained Model Configurations Listing 1: AASIST Full Configuration { " d a t a b a s e _ p a t h " : " ./ LA / " , " a s v _ s c o r e _ p a t h " : " ASVspoof 2 0 1 9 _ L A _ a s v...

work page 2025
[72]

Look down and to the right, and slowly count out loud to three

work page
[73]

Look down and to the middle, and slowly count out loud to three

work page
[74]

Look down and to the left, and slowly count out loud to three

work page
[75]

Look up and to the left, and slowly count out loud to three

work page
[76]

Look up and to the middle, and slowly count out loud to three

work page
[77]

Wave your hand back and forth across your face four times while counting out loud

Look up and to the right, and slowly count out loud to three. Wave your hand back and forth across your face four times while counting out loud. Read each question aloud, followed immediately by your answer

work page
[78]

What is my favorite food?

work page
[79]

What is my favorite movie?

work page
[80]

What did I have for breakfast?

work page

Showing first 80 references.

[1] [1]

Deep audio-visual speech recognition

Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8717–8727, 2018

work page 2018

[2] [2]

XLS-R: Self-supervised cross-lingual speech representation learning at scale

Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, and Michael Auli. XLS-R: Self-supervised cross-lingual speech representation learning at scale. In Interspeech, pages 2278–2282, 2022

work page 2022

[3] [3]

Single and multi-speaker cloned voice detection: From perceptual to learned features

Sarah Barrington, Romit Barua, Gautham Koorma, and Hany Farid. Single and multi-speaker cloned voice detection: From perceptual to learned features. InIEEE International Workshop on Information Forensics and Security, pages 1–6. IEEE, 2023

work page 2023

[4] [4]

People are poorly equipped to detect AI-powered voice clones

Sarah Barrington, Emily A Cooper, and Hany Farid. People are poorly equipped to detect AI-powered voice clones. Scientific Reports, 15(1):11004, 2025

work page 2025

[5] [5]

Deepfakes and synthetic media in the financial system: Assessing threat scenarios

Jon Bateman. Deepfakes and synthetic media in the financial system: Assessing threat scenarios. Carnegie Endowment for International Peace., 2022

work page 2022

[6] [6]

FreqNet: A frequency-domain image super-resolution network with dicrete cosine transform

Runyuan Cai, Yue Ding, and Hongtao Lu. FreqNet: A frequency-domain image super-resolution network with dicrete cosine transform. 2021

work page 2021

[7] [7]

Do you really mean that? Content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization

Zhixi Cai, Kalin Stefanov, Abhinav Dhall, and Munawar Hayat. Do you really mean that? Content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization. In International Conference on Digital Image Computing: Techniques and Applications), pages 1–10. IEEE, 2022

work page 2022

[8] [8]

Simswap: An efficient framework for high fidelity face swapping

Renwang Chen, Xuanhong Chen, Bingbing Ni, and Yanhao Ge. Simswap: An efficient framework for high fidelity face swapping. In Proceedings of the 28th ACM international conference on multimedia, pages 2003–2011, 2020

work page 2003

[9] [9]

VideoRetalking: Audio-based lip synchronization for talking head video editing in the wild

Kun Cheng, Xiaodong Cun, Yong Zhang, Menghan Xia, Fei Yin, Mingrui Zhu, Xuan Wang, Jue Wang, and Nannan Wang. VideoRetalking: Audio-based lip synchronization for talking head video editing in the wild. InSIGGRAPH Asia, pages 1–9, 2022

work page 2022

[10] [10]

Deep fakes: A looming challenge for privacy, democracy, and national security

Bobby Chesney and Danielle Citron. Deep fakes: A looming challenge for privacy, democracy, and national security. Calif. L. Rev., 107:1753, 2019

work page 2019

[11] [11]

V oxceleb2: Deep speaker recognition,

Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. V oxCeleb2: Deep speaker recognition. arXiv:1806.05622, 2018

work page arXiv 2018

[12] [12]

The malicious technical ecosystem: Exposing limi- tations in technical governance of ai-generated non-consensual intimate images of adults

Michelle L Ding and Harini Suresh. The malicious technical ecosystem: Exposing limi- tations in technical governance of ai-generated non-consensual intimate images of adults. arXiv:2504.17663, 2025

work page arXiv 2025

[13] [13]

The DeepFake Detection Challenge (DFDC) Dataset

Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cris- tian Canton Ferrer. The deepfake detection challenge (DFDC) dataset. arXiv:2006.07397, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[14] [14]

Contributing data to deepfake detection research

Nick Dufour and Andrew Gully. Contributing data to deepfake detection research. Google AI Blog, 2019

work page 2019

[15] [15]

Charting the landscape of nefarious uses of generative artificial intelligence for online election interference

Emilio Ferrara. Charting the landscape of nefarious uses of generative artificial intelligence for online election interference. arXiv:2406.01862, 2024

work page arXiv 2024

[16] [16]

J. H. Frank and L. Schönherr. WaveFake: A Data Set to Facilitate Audio DeepFake Detection. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, pages 1–18, June 2021

work page 2021

[17] [17]

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. DARPA TIMIT acoustic phonetic continuous speech corpus, 1993. 10

work page 1993

[18] [18]

Deepfake detection by human crowds, machines, and machine-informed crowds

Matthew Groh, Ziv Epstein, Chaz Firestone, and Rosalind Picard. Deepfake detection by human crowds, machines, and machine-informed crowds. Proceedings of the National Academy of Sciences, 119(1):e2110013119, 2022

work page 2022

[19] [19]

LivePortrait: Efficient portrait animation with stitching and retargeting control

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. LivePortrait: Efficient portrait animation with stitching and retargeting control. 2024

work page 2024

[20] [20]

ForgeryNet: A versatile benchmark for comprehensive forgery analysis

Yinan He, Bei Gan, Siyu Chen, Yichun Zhou, Guojun Yin, Luchuan Song, Lu Sheng, Jing Shao, and Ziwei Liu. ForgeryNet: A versatile benchmark for comprehensive forgery analysis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4360–4369, 2021

work page 2021

[21] [21]

The lj speech dataset

Keith Ito and Linda Johnson. The lj speech dataset. https://keithito.com/LJ-Speech- Dataset/, 2017

work page 2017

[22] [22]

Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, and Yonghui Wu

Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, and Yonghui Wu. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 4485–4495, Red...

work page 2018

[23] [23]

DeeperForensics-1.0: A large-scale dataset for real-world face forgery detection

Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. DeeperForensics-1.0: A large-scale dataset for real-world face forgery detection. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2889–2898, 2020

work page 2020

[24] [24]

AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks

Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans. AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In IEEE International Conference on Acoustics, Ppeech and Signal Processing, pages 6367–6371, 2022

work page 2022

[25] [25]

FakeA VCeleb: A novel audio-video multimodal deepfake dataset

Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S Woo. FakeA VCeleb: A novel audio-video multimodal deepfake dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

work page 2021

[26] [26]

TitaNet: Neural model for speaker rep- resentation with 1D depth-wise separable convolutions and global context

Nithin Rao Koluguri, Taejin Park, and Boris Ginsburg. TitaNet: Neural model for speaker rep- resentation with 1D depth-wise separable convolutions and global context. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 8102–8106. IEEE, 2022

work page 2022

[27] [27]

Towards automatic face-to-face translation

Prajwal KR, Rudrabha Mukhopadhyay, Jerin Philip, Abhishek Jha, Vinay Namboodiri, and CV Jawahar. Towards automatic face-to-face translation. In27th ACM International Conference on Multimedia, pages 1428–1436, 2019

work page 2019

[28] [28]

LatentSync: Audio conditioned latent diffusion models for lip sync

Chunyu Li, Chao Zhang, Weikai Xu, Jinghui Xie, Weiguo Feng, Bingyue Peng, and Weiwei Xing. LatentSync: Audio conditioned latent diffusion models for lip sync. 2024

work page 2024

[29] [29]

Celeb-DF: A large-scale challenging dataset for deepfake forensics

Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-DF: A large-scale challenging dataset for deepfake forensics. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3207–3216, 2020

work page 2020

[30] [30]

BlendGAN: Implicitly GAN blending for arbitrary stylized face generation.Advances in Neural Information Processing Systems, 34:29710–29722, 2021

Mingcong Liu, Qiang Li, Zekui Qin, Guoxin Zhang, Pengfei Wan, and Wen Zheng. BlendGAN: Implicitly GAN blending for arbitrary stylized face generation.Advances in Neural Information Processing Systems, 34:29710–29722, 2021

work page 2021

[31] [31]

Lips are lying: Spotting the temporal inconsistency between audio and visual in lip-syncing deepfakes.Advances in Neural Information Processing Systems, 37:91131–91155, 2024

Weifeng Liu, Tianyi She, Jiawei Liu, Boheng Li, Dongyu Yao, and Run Wang. Lips are lying: Spotting the temporal inconsistency between audio and visual in lip-syncing deepfakes.Advances in Neural Information Processing Systems, 37:91131–91155, 2024

work page 2024

[32] [32]

Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild.IEEE/ACM Trans

Xuechen Liu, Xin Wang, Md Sahidullah, Jose Patino, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas Evans, Andreas Nautsch, and Kong Aik Lee. Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild.IEEE/ACM Trans. Audio, Speech and Lang. Proc., 31:2507–2522, June 2023

work page 2021

[33] [33]

The Ryerson audio-visual database of emotional speech and song (RA VDESS): A dynamic, multimodal set of facial and vocal expressions in north american english

Steven R Livingstone and Frank A Russo. The Ryerson audio-visual database of emotional speech and song (RA VDESS): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5):e0196391, 2018. 11

work page 2018

[34] [34]

MediaPipe: A Framework for Building Perception Pipelines

Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. MediaPipe: A framework for building perception pipelines. arXiv:1906.08172, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[35] [35]

Diff2Lip: Audio conditioned diffusion models for lip-synchronization

Soumik Mukhopadhyay, Saksham Suri, Ravi Teja Gadde, and Abhinav Shrivastava. Diff2Lip: Audio conditioned diffusion models for lip-synchronization. InIEEE/CVF Winter Conference on Applications of Computer Vision, pages 5292–5302, 2024

work page 2024

[36] [36]

V oxceleb: a large-scale speaker identiﬁcation dataset,

Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. V oxCeleb: A large-scale speaker identification dataset. arXiv:1706.08612, 2017

work page arXiv 2017

[37] [37]

Nightingale and Hany Farid

Sophie J. Nightingale and Hany Farid. AI-synthesized faces are indistinguishable from real faces and more trustworthy.Proceedings of the National Academy of Sciences, 119(8):e2120481119, 2022

work page 2022

[38] [38]

Training-free deepfake voice recognition by leveraging large-scale pre-trained models

Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. Training-free deepfake voice recognition by leveraging large-scale pre-trained models. In ACM Workshop on Information Hiding and Multimedia Security, page 289–294, New York, NY , USA, 2024

work page 2024

[39] [39]

A lip sync expert is all you need for speech to lip generation in the wild

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. In28th ACM International Conference on Multimedia, pages 484–492, 2020

work page 2020

[40] [40]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

work page 2021

[41] [41]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023

work page 2023

[42] [42]

Facial geometric detail recovery via implicit representation

Xingyu Ren, Alexandros Lattas, Baris Gecer, Jiankang Deng, Chao Ma, and Xiaokang Yang. Facial geometric detail recovery via implicit representation. In IEEE 17th International Conference on Automatic Face and Gesture Recognition, 2023

work page 2023

[43] [43]

FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces

Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. FaceForensics: A large-scale video dataset for forgery detection in human faces. arXiv:1803.09179, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[44] [44]

Faceforensics++: Learning to detect manipulated facial images

Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. In IEEE/CVF International Conference on Computer Vision, pages 1–11, 2019

work page 2019

[45] [45]

Ieee recommended practice for speech quality measurements

Ernst H Rothauser. Ieee recommended practice for speech quality measurements. IEEE Transactions on Audio and Electroacoustics, 17(3):225–246, 1969

work page 1969

[46] [46]

FaceFusion

Henry Ruhs. FaceFusion. https://github.com/facefusion/facefusion, 2024

work page 2024

[47] [47]

JSUT corpus: Free large-scale Japanese speech corpus for end-to-end speech synthesis, 2017

Ryosuke Sonobe, Shinnosuke Takamichi, and Hiroshi Saruwatari. JSUT corpus: Free large-scale Japanese speech corpus for end-to-end speech synthesis, 2017. arXiv preprint

work page 2017

[48] [48]

AI-enabled influence operations: Safeguarding future elections

Sam Stockwell, Megan Hughes, Phil Swatton, Albert Zhang, Jonathan Hall KC, and Kieran. AI-enabled influence operations: Safeguarding future elections. Technical report, Centre for Emerging Technology and Security (CETaS), The Alan Turing Institute, November 2024

work page 2024

[49] [49]

MultiTalk: Enhancing 3D talking head generation across languages with multilingual video dataset

Kim Sung-Bin, Lee Chae-Yeon, Gihun Son, Oh Hyun-Bin, Janghoon Ju, Suekyeong Nam, and Tae-Hyun Oh. MultiTalk: Enhancing 3D talking head generation across languages with multilingual video dataset. arXiv:2406.14272, 2024

work page arXiv 2024

[50] [50]

End-to-end spectro-temporal graph attention networks for speaker ver- ification anti-spoofing and speech deepfake detection,

Hemlata Tak, Jee-Weon Jung, Jose Patino, Madhu Kamble, Massimiliano Todisco, and Nicholas Evans. End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection. arXiv:2107.12710, 2021. 12

work page arXiv 2021

[51] [51]

End-to-end anti-spoofing with RawNet2

Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. End-to-end anti-spoofing with RawNet2. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 6369–6373, 2021

work page 2021

[52] [52]

Deepfakes and disinformation: Exploring the impact of synthetic political video on deception, uncertainty, and trust in news.Social media+ society, 6(1):2056305120903408, 2020

Cristian Vaccari and Andrew Chadwick. Deepfakes and disinformation: Exploring the impact of synthetic political video on deception, uncertainty, and trust in news.Social media+ society, 6(1):2056305120903408, 2020

work page 2020

[53] [53]

Designed to abuse? deepfakes and the non-consensual diffusion of intimate images

Marco Viola and Cristina V oto. Designed to abuse? deepfakes and the non-consensual diffusion of intimate images. Synthese, 201(1):30, 2023

work page 2023

[54] [54]

INSwapper: Face swapping model based on insightface

Haofan Wang. INSwapper: Face swapping model based on insightface. https : //github.com/haofanwang/inswapper, 2023

work page 2023

[55] [55]

Mead: A large-scale audio-visual dataset for emotional talking- face generation

Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. Mead: A large-scale audio-visual dataset for emotional talking- face generation. In European Conference on Computer Vision, pages 700–717. Springer, 2020

work page 2020

[56] [56]

RestoreFormer++: Towards real-world blind face restoration from undegraded key-value pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

Zhouxia Wang, Jiawei Zhang, Tianshui Chen, Wenping Wang, and Ping Luo. RestoreFormer++: Towards real-world blind face restoration from undegraded key-value pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

work page 2023

[57] [57]

Speech accent archive, 2015

Steven Weinberger. Speech accent archive, 2015. Retrieved from the Speech Accent Archive

work page 2015

[58] [58]

Deepfake video detection using generative convolutional vision transformer

Deressa Wodajo, Solomon Atnafu, and Zahid Akhtar. Deepfake video detection using generative convolutional vision transformer. 2023

work page 2023

[59] [59]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2023

work page 2023

[60] [60]

VFHQ: A high-quality dataset and benchmark for video face super-resolution

Liangbin Xie, Xintao Wang, Honglun Zhang, Chao Dong, and Ying Shan. VFHQ: A high-quality dataset and benchmark for video face super-resolution. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 657–666, 2022

work page 2022

[61] [61]

DF40: Toward next-generation deepfake detection

Zhiyuan Yan, Taiping Yao, Shen Chen, Yandan Zhao, Xinghe Fu, Junwei Zhu, Donghao Luo, Chengjie Wang, Shouhong Ding, Yunsheng Wu, et al. DF40: Toward next-generation deepfake detection. arXiv:2406.13495, 2024

work page arXiv 2024

[62] [62]

LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild

Shuang Yang, Yuanhang Zhang, Dalu Feng, Mingmin Yang, Chenhao Wang, Jingyun Xiao, Keyu Long, Shiguang Shan, and Xilin Chen. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. In IEEE International Conference on Automatic Face & Gesture Recognition, pages 1–8. IEEE, 2019

work page 2019

[63] [63]

HelloMeme: Integrating spatial knitting attentions to embed high-level and fidelity-rich conditions in diffusion models

Shengkai Zhang, Nianhong Jiao, Tian Li, Chaojie Yang, Chenhui Xue, Boya Niu, and Jun Gao. HelloMeme: Integrating spatial knitting attentions to embed high-level and fidelity-rich conditions in diffusion models. 2024

work page 2024

[64] [64]

Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset

Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021

work page 2021

[65] [65]

MEMO: Memory-guided diffusion for expressive talking video generation

Longtao Zheng, Yifan Zhang, Hanzhong Guo, Jiachun Pan, Zhenxiong Tan, Jiahao Lu, Chuanxin Tang, Bo An, and Shuicheng Yan. MEMO: Memory-guided diffusion for expressive talking video generation. 2024

work page 2024

[66] [66]

Towards robust blind face restoration with codebook lookup transformer

Shangchen Zhou, Kelvin Chan, Chongyi Li, and Chen Change Loy. Towards robust blind face restoration with codebook lookup transformer. Advances in Neural Information Processing Systems, 35:30599–30611, 2022

work page 2022

[67] [67]

Chan, Chongyi Li, and Chen Change Loy

Shangchen Zhou, Kelvin C.K. Chan, Chongyi Li, and Chen Change Loy. Towards robust blind face restoration with codebook lookup transformer. InNeurIPS, 2022. 13

work page 2022

[68] [68]

CelebV-HQ: A large-scale video facial attributes dataset

Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. CelebV-HQ: A large-scale video facial attributes dataset. In European Conference on Computer Vision, pages 650–667. Springer, 2022

work page 2022

[69] [69]

pretrained

Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. WildDeepfake: A challenging real-world dataset for deepfake detection. In28th ACM International Conference on Multimedia, pages 2382–2390, 2020. 14 Appendix Table of Contents A Release and Usage Information 17 B Related Work 18 B.1 Video . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 2020

[70] [70]

LogisticRegression: max iterations = 1000, with all other parameters as per the defaults in the Scikit-learn LogisticRegression model

work page

[71] [71]

d a t a b a s e _ p a t h

RandomForestClassifier: number of estimators = 100, with all other parameters as per the defaults in the Scikit-learn RandomForestClassifier model. 36 L.3 Audio Raw Waveform Pretrained Model Configurations Listing 1: AASIST Full Configuration { " d a t a b a s e _ p a t h " : " ./ LA / " , " a s v _ s c o r e _ p a t h " : " ASVspoof 2 0 1 9 _ L A _ a s v...

work page 2025

[72] [72]

Look down and to the right, and slowly count out loud to three

work page

[73] [73]

Look down and to the middle, and slowly count out loud to three

work page

[74] [74]

Look down and to the left, and slowly count out loud to three

work page

[75] [75]

Look up and to the left, and slowly count out loud to three

work page

[76] [76]

Look up and to the middle, and slowly count out loud to three

work page

[77] [77]

Wave your hand back and forth across your face four times while counting out loud

Look up and to the right, and slowly count out loud to three. Wave your hand back and forth across your face four times while counting out loud. Read each question aloud, followed immediately by your answer

work page

[78] [78]

What is my favorite food?

work page

[79] [79]

What is my favorite movie?

work page

[80] [80]

What did I have for breakfast?

work page