pith. sign in

arxiv: 2408.05366 · v5 · submitted 2024-08-09 · 💻 cs.CV

The DeepSpeak Dataset

Pith reviewed 2026-05-23 21:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords deepfake detectiondeepfake datasettalking headsvideo synthesisvoice cloningidentity matchingmultimodal contentadversarial attacks
0
0 comments X

The pith

Current deepfake detectors fail to generalize to a new dataset of high-quality talking-head fakes without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the DeepSpeak dataset of over 100 hours of real and deepfake audiovisual content focused on talking heads. Real recordings come from 500 consenting participants while fakes are produced with 14 video synthesis engines and three voice cloning engines through an embedding-based identity-matching process. Large-scale tests on existing detectors show they do not maintain performance on this material, indicating that training data must incorporate the latest generative tools to address realistic adversarial scenarios such as impostor attacks in video calls.

Core claim

Without retraining, state-of-the-art deepfake detectors fail to generalize to the DeepSpeak dataset, which contains more than 50 hours of self-recorded real data from 500 participants and more than 50 hours of state-of-the-art deepfakes created with 14 video engines, three voice engines, and identity-matched face swaps.

What carries the argument

The DeepSpeak dataset, constructed via embedding-based identity matching to produce realistic audiovisual deepfakes from multiple current synthesis engines.

If this is right

  • Detectors require retraining on data from the newest synthesis engines to regain effectiveness against talking-head deepfakes.
  • Identity-matching protocols produce higher-quality simulated attacks than random pairing methods.
  • Multimodal audiovisual coverage is required to evaluate detectors intended for video-conference and identity-verification settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Video-conference platforms could incorporate periodic retraining on datasets like this one to reduce impostor risks.
  • Engine-specific failure patterns in the dataset could guide targeted improvements to detection methods.

Load-bearing premise

Deepfakes generated with the 14 video and three voice engines plus identity matching serve as realistic stand-ins for actual adversarial attacks.

What would settle it

Measure the accuracy of the tested detectors on the full DeepSpeak test set and compare it to their reported accuracy on their original training distributions.

Figures

Figures reproduced from arXiv: 2408.05366 by Hany Farid, Maty Bohacek, Sarah Barrington.

Figure 1
Figure 1. Figure 1: An overview of the DeepSpeak Dataset sourced from a diverse selection of consenting participants using a custom-built data collection methodology. The dataset also comprises deepfakes generated from 14 video and three audio deepfake methods using facial identity matching to improve the realism of the generated deepfakes. generators) spliced into a subset of the lip-sync deepfakes [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 2
Figure 2. Figure 2: A t-SNE visualization of CLIP embeddings from real participant’s videos. The four [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of DeepSpeak deepfakes: Face-swap deepfakes replace the facial region from a source identity with the face of a target identity while retaining the audio from the target video. Lip-sync deepfakes overlay a generated mouth region onto the original face and synchronize it with a new audio track (real or fake). Avatar deepfakes animate the head and shoulders of an identity based on a single still fra… view at source ↗
Figure 4
Figure 4. Figure 4: A screenshot of the custom-built recording tool used for real participant data collection. [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Video frames sampled from three representative examples of [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Additional examples of matched identities from the first release of DeepSpeak data [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Additional examples of matched identities from the second release of DeepSpeak data [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Screenshot of the introduction and consent page of the data collection study. [PITH_FULL_IMAGE:figures/full_fig_p042_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Screenshot of the overview and recording instructions page of the data collection study. [PITH_FULL_IMAGE:figures/full_fig_p042_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Screenshot of the environment checks page of the data collection study. [PITH_FULL_IMAGE:figures/full_fig_p043_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Screenshot of the recording test page of the data collection study. [PITH_FULL_IMAGE:figures/full_fig_p043_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Screenshot of an example prompt from the data collection study. [PITH_FULL_IMAGE:figures/full_fig_p044_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Representative examples of different action prompt types. [PITH_FULL_IMAGE:figures/full_fig_p047_26.png] view at source ↗
read the original abstract

Deepfakes represent a growing concern across domains such as disinformation, fraud, and non-consensual media. In particular, the rise of video conference and identity-driven attacks in high-stakes scenarios--such as impostor hiring--demands new forensic resources. Despite significant efforts to develop robust detection classifiers to distinguish the real from the fake, commonly used training datasets remain inadequate: relying on low-quality and outdated deepfake generators, consisting of content scraped from online repositories without participant consent, lacking in multimodal coverage, and rarely employing identity-matching protocols to ensure realistic fakes. To overcome these limitations, we present the DeepSpeak dataset, a diverse and multimodal dataset comprising over 100 hours of authentic and deepfake audiovisual content, specifically focused on the challenging and diverse talking heads context. We contribute: i) more than 50 hours of real, self-recorded data collected from 500 diverse and consenting participants, ii) more than 50 hours of state-of-the-art audio and visual deepfakes generated using 14 video synthesis engines and three voice cloning engines, and iii) an embedding-based, identity-matching approach to ensure the creation of convincing, high-quality identity face swaps that realistically simulate adversarial deepfake attacks. We also perform large-scale evaluations of state-of-the-art deepfake detectors and show that, without retraining, these detectors fail to generalize to this DeepSpeak dataset, highlighting the importance of a large and diverse dataset containing deepfakes from the latest generative-AI tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces the DeepSpeak dataset: over 100 hours of real and deepfake audiovisual talking-head content from 500 consenting participants, generated with 14 video synthesis engines, 3 voice cloning engines, and embedding-based identity matching. It reports large-scale evaluations claiming that state-of-the-art deepfake detectors fail to generalize to this dataset without retraining, underscoring the need for more diverse and modern training resources.

Significance. If the generalization experiments hold after addressing potential confounds, the dataset would provide a valuable, consent-based, multimodal benchmark focused on high-stakes talking-head scenarios and recent generators, potentially advancing detector robustness.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'without retraining, these detectors fail to generalize' is presented without any reported metrics, list of tested models, quantitative results, or description of the evaluation protocol, preventing verification of the empirical finding.
  2. [Evaluation / Results] No section describes explicit controls, ablations, or matched real-video subsets to isolate the contribution of the 14 new synthesis engines from domain shift in the real class (self-recorded webcam videos with varied lighting/backgrounds versus the controlled corpora on which most detectors were trained, e.g., FaceForensics++). This attribution is load-bearing for the paper's main conclusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the two major comments point by point below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'without retraining, these detectors fail to generalize' is presented without any reported metrics, list of tested models, quantitative results, or description of the evaluation protocol, preventing verification of the empirical finding.

    Authors: The abstract is a concise summary and therefore omits the full list of models, exact metrics, and protocol details. These are provided in the Evaluation section of the manuscript, which reports results across multiple state-of-the-art detectors, the evaluation protocol (cross-dataset testing without retraining), and quantitative outcomes demonstrating poor generalization. To improve verifiability from the abstract itself, we will add a sentence summarizing the key quantitative finding (e.g., average AUC drop) while respecting length limits. revision: yes

  2. Referee: [Evaluation / Results] No section describes explicit controls, ablations, or matched real-video subsets to isolate the contribution of the 14 new synthesis engines from domain shift in the real class (self-recorded webcam videos with varied lighting/backgrounds versus the controlled corpora on which most detectors were trained, e.g., FaceForensics++). This attribution is load-bearing for the paper's main conclusion.

    Authors: We acknowledge that the current manuscript does not contain explicit ablations or matched real-video subsets that would fully isolate the effect of the 14 synthesis engines from potential domain shift in the real videos. The evaluation instead emphasizes end-to-end generalization failure on modern, identity-matched deepfakes generated from the new real recordings. We agree this is a substantive point and will add a dedicated ablation subsection using matched real subsets drawn from controlled corpora (e.g., FaceForensics++) to better attribute performance drops to the new generators versus real-video domain differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release and benchmark

full rationale

The paper is a dataset contribution plus empirical evaluation of existing detectors on new data. No derivations, equations, fitted parameters renamed as predictions, or first-principles claims appear in the abstract or described content. The central result (detectors fail to generalize) is a direct empirical measurement, not a reduction to self-defined inputs or self-citations. The work is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no new mathematical parameters, axioms, or invented entities; it relies on existing generative-AI tools and standard data-collection practices.

pith-pipeline@v0.9.0 · 5791 in / 1142 out tokens · 34319 ms · 2026-05-23T21:44:11.068351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Scalable, Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection

    cs.CV 2026-05 conditional novelty 6.0

    Hybrid optical-digital architecture multiplexes 15+ video streams for parallel deepfake detection, reporting 97.79% average accuracy on Celeb-DF with resilience to degradation and attacks.

  2. The Alpha Blending Hypothesis: Compositing Shortcut in Deepfake Detection

    cs.CV 2026-05 unverdicted novelty 6.0

    Deepfake detectors act as alpha blending searchers; training solely on self-blended real images yields top cross-dataset generalization on 15 datasets without using synthetic deepfakes.

  3. Deepfake Detection that Generalizes Across Benchmarks

    cs.CV 2025-08 accept novelty 6.0

    GenD achieves state-of-the-art average cross-dataset AUROC in deepfake detection by parameter-efficient adaptation of a foundational vision encoder with hyperspherical manifold enforcement via L2 normalization and met...

  4. Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection

    cs.CV 2026-05 unverdicted novelty 5.0

    Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.

  5. EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

    cs.AI 2026-05 unverdicted novelty 4.0

    Emo-Boost augments low-level deepfake detectors with intra- and inter-modal emotion consistency checks to raise cross-manipulation generalization AUC by 2.1% on FakeAVCeleb.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 5 Pith papers · 3 internal anchors

  1. [1]

    Deep audio-visual speech recognition

    Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8717–8727, 2018

  2. [2]

    XLS-R: Self-supervised cross-lingual speech representation learning at scale

    Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, and Michael Auli. XLS-R: Self-supervised cross-lingual speech representation learning at scale. In Interspeech, pages 2278–2282, 2022

  3. [3]

    Single and multi-speaker cloned voice detection: From perceptual to learned features

    Sarah Barrington, Romit Barua, Gautham Koorma, and Hany Farid. Single and multi-speaker cloned voice detection: From perceptual to learned features. InIEEE International Workshop on Information Forensics and Security, pages 1–6. IEEE, 2023

  4. [4]

    People are poorly equipped to detect AI-powered voice clones

    Sarah Barrington, Emily A Cooper, and Hany Farid. People are poorly equipped to detect AI-powered voice clones. Scientific Reports, 15(1):11004, 2025

  5. [5]

    Deepfakes and synthetic media in the financial system: Assessing threat scenarios

    Jon Bateman. Deepfakes and synthetic media in the financial system: Assessing threat scenarios. Carnegie Endowment for International Peace., 2022

  6. [6]

    FreqNet: A frequency-domain image super-resolution network with dicrete cosine transform

    Runyuan Cai, Yue Ding, and Hongtao Lu. FreqNet: A frequency-domain image super-resolution network with dicrete cosine transform. 2021

  7. [7]

    Do you really mean that? Content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization

    Zhixi Cai, Kalin Stefanov, Abhinav Dhall, and Munawar Hayat. Do you really mean that? Content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization. In International Conference on Digital Image Computing: Techniques and Applications), pages 1–10. IEEE, 2022

  8. [8]

    Simswap: An efficient framework for high fidelity face swapping

    Renwang Chen, Xuanhong Chen, Bingbing Ni, and Yanhao Ge. Simswap: An efficient framework for high fidelity face swapping. In Proceedings of the 28th ACM international conference on multimedia, pages 2003–2011, 2020

  9. [9]

    VideoRetalking: Audio-based lip synchronization for talking head video editing in the wild

    Kun Cheng, Xiaodong Cun, Yong Zhang, Menghan Xia, Fei Yin, Mingrui Zhu, Xuan Wang, Jue Wang, and Nannan Wang. VideoRetalking: Audio-based lip synchronization for talking head video editing in the wild. InSIGGRAPH Asia, pages 1–9, 2022

  10. [10]

    Deep fakes: A looming challenge for privacy, democracy, and national security

    Bobby Chesney and Danielle Citron. Deep fakes: A looming challenge for privacy, democracy, and national security. Calif. L. Rev., 107:1753, 2019

  11. [11]

    V oxceleb2: Deep speaker recognition,

    Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. V oxCeleb2: Deep speaker recognition. arXiv:1806.05622, 2018

  12. [12]

    The malicious technical ecosystem: Exposing limi- tations in technical governance of ai-generated non-consensual intimate images of adults

    Michelle L Ding and Harini Suresh. The malicious technical ecosystem: Exposing limi- tations in technical governance of ai-generated non-consensual intimate images of adults. arXiv:2504.17663, 2025

  13. [13]

    The DeepFake Detection Challenge (DFDC) Dataset

    Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cris- tian Canton Ferrer. The deepfake detection challenge (DFDC) dataset. arXiv:2006.07397, 2020

  14. [14]

    Contributing data to deepfake detection research

    Nick Dufour and Andrew Gully. Contributing data to deepfake detection research. Google AI Blog, 2019

  15. [15]

    Charting the landscape of nefarious uses of generative artificial intelligence for online election interference

    Emilio Ferrara. Charting the landscape of nefarious uses of generative artificial intelligence for online election interference. arXiv:2406.01862, 2024

  16. [16]

    J. H. Frank and L. Schönherr. WaveFake: A Data Set to Facilitate Audio DeepFake Detection. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, pages 1–18, June 2021

  17. [17]

    J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. DARPA TIMIT acoustic phonetic continuous speech corpus, 1993. 10

  18. [18]

    Deepfake detection by human crowds, machines, and machine-informed crowds

    Matthew Groh, Ziv Epstein, Chaz Firestone, and Rosalind Picard. Deepfake detection by human crowds, machines, and machine-informed crowds. Proceedings of the National Academy of Sciences, 119(1):e2110013119, 2022

  19. [19]

    LivePortrait: Efficient portrait animation with stitching and retargeting control

    Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. LivePortrait: Efficient portrait animation with stitching and retargeting control. 2024

  20. [20]

    ForgeryNet: A versatile benchmark for comprehensive forgery analysis

    Yinan He, Bei Gan, Siyu Chen, Yichun Zhou, Guojun Yin, Luchuan Song, Lu Sheng, Jing Shao, and Ziwei Liu. ForgeryNet: A versatile benchmark for comprehensive forgery analysis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4360–4369, 2021

  21. [21]

    The lj speech dataset

    Keith Ito and Linda Johnson. The lj speech dataset. https://keithito.com/LJ-Speech- Dataset/, 2017

  22. [22]

    Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, and Yonghui Wu

    Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, and Yonghui Wu. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 4485–4495, Red...

  23. [23]

    DeeperForensics-1.0: A large-scale dataset for real-world face forgery detection

    Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. DeeperForensics-1.0: A large-scale dataset for real-world face forgery detection. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2889–2898, 2020

  24. [24]

    AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks

    Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans. AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In IEEE International Conference on Acoustics, Ppeech and Signal Processing, pages 6367–6371, 2022

  25. [25]

    FakeA VCeleb: A novel audio-video multimodal deepfake dataset

    Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S Woo. FakeA VCeleb: A novel audio-video multimodal deepfake dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

  26. [26]

    TitaNet: Neural model for speaker rep- resentation with 1D depth-wise separable convolutions and global context

    Nithin Rao Koluguri, Taejin Park, and Boris Ginsburg. TitaNet: Neural model for speaker rep- resentation with 1D depth-wise separable convolutions and global context. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 8102–8106. IEEE, 2022

  27. [27]

    Towards automatic face-to-face translation

    Prajwal KR, Rudrabha Mukhopadhyay, Jerin Philip, Abhishek Jha, Vinay Namboodiri, and CV Jawahar. Towards automatic face-to-face translation. In27th ACM International Conference on Multimedia, pages 1428–1436, 2019

  28. [28]

    LatentSync: Audio conditioned latent diffusion models for lip sync

    Chunyu Li, Chao Zhang, Weikai Xu, Jinghui Xie, Weiguo Feng, Bingyue Peng, and Weiwei Xing. LatentSync: Audio conditioned latent diffusion models for lip sync. 2024

  29. [29]

    Celeb-DF: A large-scale challenging dataset for deepfake forensics

    Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-DF: A large-scale challenging dataset for deepfake forensics. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3207–3216, 2020

  30. [30]

    BlendGAN: Implicitly GAN blending for arbitrary stylized face generation.Advances in Neural Information Processing Systems, 34:29710–29722, 2021

    Mingcong Liu, Qiang Li, Zekui Qin, Guoxin Zhang, Pengfei Wan, and Wen Zheng. BlendGAN: Implicitly GAN blending for arbitrary stylized face generation.Advances in Neural Information Processing Systems, 34:29710–29722, 2021

  31. [31]

    Lips are lying: Spotting the temporal inconsistency between audio and visual in lip-syncing deepfakes.Advances in Neural Information Processing Systems, 37:91131–91155, 2024

    Weifeng Liu, Tianyi She, Jiawei Liu, Boheng Li, Dongyu Yao, and Run Wang. Lips are lying: Spotting the temporal inconsistency between audio and visual in lip-syncing deepfakes.Advances in Neural Information Processing Systems, 37:91131–91155, 2024

  32. [32]

    Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild.IEEE/ACM Trans

    Xuechen Liu, Xin Wang, Md Sahidullah, Jose Patino, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas Evans, Andreas Nautsch, and Kong Aik Lee. Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild.IEEE/ACM Trans. Audio, Speech and Lang. Proc., 31:2507–2522, June 2023

  33. [33]

    The Ryerson audio-visual database of emotional speech and song (RA VDESS): A dynamic, multimodal set of facial and vocal expressions in north american english

    Steven R Livingstone and Frank A Russo. The Ryerson audio-visual database of emotional speech and song (RA VDESS): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5):e0196391, 2018. 11

  34. [34]

    MediaPipe: A Framework for Building Perception Pipelines

    Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. MediaPipe: A framework for building perception pipelines. arXiv:1906.08172, 2019

  35. [35]

    Diff2Lip: Audio conditioned diffusion models for lip-synchronization

    Soumik Mukhopadhyay, Saksham Suri, Ravi Teja Gadde, and Abhinav Shrivastava. Diff2Lip: Audio conditioned diffusion models for lip-synchronization. InIEEE/CVF Winter Conference on Applications of Computer Vision, pages 5292–5302, 2024

  36. [36]

    V oxceleb: a large-scale speaker identification dataset,

    Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. V oxCeleb: A large-scale speaker identification dataset. arXiv:1706.08612, 2017

  37. [37]

    Nightingale and Hany Farid

    Sophie J. Nightingale and Hany Farid. AI-synthesized faces are indistinguishable from real faces and more trustworthy.Proceedings of the National Academy of Sciences, 119(8):e2120481119, 2022

  38. [38]

    Training-free deepfake voice recognition by leveraging large-scale pre-trained models

    Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. Training-free deepfake voice recognition by leveraging large-scale pre-trained models. In ACM Workshop on Information Hiding and Multimedia Security, page 289–294, New York, NY , USA, 2024

  39. [39]

    A lip sync expert is all you need for speech to lip generation in the wild

    KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. In28th ACM International Conference on Multimedia, pages 484–492, 2020

  40. [40]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

  41. [41]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023

  42. [42]

    Facial geometric detail recovery via implicit representation

    Xingyu Ren, Alexandros Lattas, Baris Gecer, Jiankang Deng, Chao Ma, and Xiaokang Yang. Facial geometric detail recovery via implicit representation. In IEEE 17th International Conference on Automatic Face and Gesture Recognition, 2023

  43. [43]

    FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces

    Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. FaceForensics: A large-scale video dataset for forgery detection in human faces. arXiv:1803.09179, 2018

  44. [44]

    Faceforensics++: Learning to detect manipulated facial images

    Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. In IEEE/CVF International Conference on Computer Vision, pages 1–11, 2019

  45. [45]

    Ieee recommended practice for speech quality measurements

    Ernst H Rothauser. Ieee recommended practice for speech quality measurements. IEEE Transactions on Audio and Electroacoustics, 17(3):225–246, 1969

  46. [46]

    FaceFusion

    Henry Ruhs. FaceFusion. https://github.com/facefusion/facefusion, 2024

  47. [47]

    JSUT corpus: Free large-scale Japanese speech corpus for end-to-end speech synthesis, 2017

    Ryosuke Sonobe, Shinnosuke Takamichi, and Hiroshi Saruwatari. JSUT corpus: Free large-scale Japanese speech corpus for end-to-end speech synthesis, 2017. arXiv preprint

  48. [48]

    AI-enabled influence operations: Safeguarding future elections

    Sam Stockwell, Megan Hughes, Phil Swatton, Albert Zhang, Jonathan Hall KC, and Kieran. AI-enabled influence operations: Safeguarding future elections. Technical report, Centre for Emerging Technology and Security (CETaS), The Alan Turing Institute, November 2024

  49. [49]

    MultiTalk: Enhancing 3D talking head generation across languages with multilingual video dataset

    Kim Sung-Bin, Lee Chae-Yeon, Gihun Son, Oh Hyun-Bin, Janghoon Ju, Suekyeong Nam, and Tae-Hyun Oh. MultiTalk: Enhancing 3D talking head generation across languages with multilingual video dataset. arXiv:2406.14272, 2024

  50. [50]

    End-to-end spectro-temporal graph attention networks for speaker ver- ification anti-spoofing and speech deepfake detection,

    Hemlata Tak, Jee-Weon Jung, Jose Patino, Madhu Kamble, Massimiliano Todisco, and Nicholas Evans. End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection. arXiv:2107.12710, 2021. 12

  51. [51]

    End-to-end anti-spoofing with RawNet2

    Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. End-to-end anti-spoofing with RawNet2. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 6369–6373, 2021

  52. [52]

    Deepfakes and disinformation: Exploring the impact of synthetic political video on deception, uncertainty, and trust in news.Social media+ society, 6(1):2056305120903408, 2020

    Cristian Vaccari and Andrew Chadwick. Deepfakes and disinformation: Exploring the impact of synthetic political video on deception, uncertainty, and trust in news.Social media+ society, 6(1):2056305120903408, 2020

  53. [53]

    Designed to abuse? deepfakes and the non-consensual diffusion of intimate images

    Marco Viola and Cristina V oto. Designed to abuse? deepfakes and the non-consensual diffusion of intimate images. Synthese, 201(1):30, 2023

  54. [54]

    INSwapper: Face swapping model based on insightface

    Haofan Wang. INSwapper: Face swapping model based on insightface. https : //github.com/haofanwang/inswapper, 2023

  55. [55]

    Mead: A large-scale audio-visual dataset for emotional talking- face generation

    Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. Mead: A large-scale audio-visual dataset for emotional talking- face generation. In European Conference on Computer Vision, pages 700–717. Springer, 2020

  56. [56]

    RestoreFormer++: Towards real-world blind face restoration from undegraded key-value pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

    Zhouxia Wang, Jiawei Zhang, Tianshui Chen, Wenping Wang, and Ping Luo. RestoreFormer++: Towards real-world blind face restoration from undegraded key-value pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

  57. [57]

    Speech accent archive, 2015

    Steven Weinberger. Speech accent archive, 2015. Retrieved from the Speech Accent Archive

  58. [58]

    Deepfake video detection using generative convolutional vision transformer

    Deressa Wodajo, Solomon Atnafu, and Zahid Akhtar. Deepfake video detection using generative convolutional vision transformer. 2023

  59. [59]

    Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

    Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2023

  60. [60]

    VFHQ: A high-quality dataset and benchmark for video face super-resolution

    Liangbin Xie, Xintao Wang, Honglun Zhang, Chao Dong, and Ying Shan. VFHQ: A high-quality dataset and benchmark for video face super-resolution. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 657–666, 2022

  61. [61]

    DF40: Toward next-generation deepfake detection

    Zhiyuan Yan, Taiping Yao, Shen Chen, Yandan Zhao, Xinghe Fu, Junwei Zhu, Donghao Luo, Chengjie Wang, Shouhong Ding, Yunsheng Wu, et al. DF40: Toward next-generation deepfake detection. arXiv:2406.13495, 2024

  62. [62]

    LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild

    Shuang Yang, Yuanhang Zhang, Dalu Feng, Mingmin Yang, Chenhao Wang, Jingyun Xiao, Keyu Long, Shiguang Shan, and Xilin Chen. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. In IEEE International Conference on Automatic Face & Gesture Recognition, pages 1–8. IEEE, 2019

  63. [63]

    HelloMeme: Integrating spatial knitting attentions to embed high-level and fidelity-rich conditions in diffusion models

    Shengkai Zhang, Nianhong Jiao, Tian Li, Chaojie Yang, Chenhui Xue, Boya Niu, and Jun Gao. HelloMeme: Integrating spatial knitting attentions to embed high-level and fidelity-rich conditions in diffusion models. 2024

  64. [64]

    Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset

    Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021

  65. [65]

    MEMO: Memory-guided diffusion for expressive talking video generation

    Longtao Zheng, Yifan Zhang, Hanzhong Guo, Jiachun Pan, Zhenxiong Tan, Jiahao Lu, Chuanxin Tang, Bo An, and Shuicheng Yan. MEMO: Memory-guided diffusion for expressive talking video generation. 2024

  66. [66]

    Towards robust blind face restoration with codebook lookup transformer

    Shangchen Zhou, Kelvin Chan, Chongyi Li, and Chen Change Loy. Towards robust blind face restoration with codebook lookup transformer. Advances in Neural Information Processing Systems, 35:30599–30611, 2022

  67. [67]

    Chan, Chongyi Li, and Chen Change Loy

    Shangchen Zhou, Kelvin C.K. Chan, Chongyi Li, and Chen Change Loy. Towards robust blind face restoration with codebook lookup transformer. InNeurIPS, 2022. 13

  68. [68]

    CelebV-HQ: A large-scale video facial attributes dataset

    Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. CelebV-HQ: A large-scale video facial attributes dataset. In European Conference on Computer Vision, pages 650–667. Springer, 2022

  69. [69]

    pretrained

    Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. WildDeepfake: A challenging real-world dataset for deepfake detection. In28th ACM International Conference on Multimedia, pages 2382–2390, 2020. 14 Appendix Table of Contents A Release and Usage Information 17 B Related Work 18 B.1 Video . . . . . . . . . . . . . . . . . . . . . . . . ...

  70. [70]

    LogisticRegression: max iterations = 1000, with all other parameters as per the defaults in the Scikit-learn LogisticRegression model

  71. [71]

    d a t a b a s e _ p a t h

    RandomForestClassifier: number of estimators = 100, with all other parameters as per the defaults in the Scikit-learn RandomForestClassifier model. 36 L.3 Audio Raw Waveform Pretrained Model Configurations Listing 1: AASIST Full Configuration { " d a t a b a s e _ p a t h " : " ./ LA / " , " a s v _ s c o r e _ p a t h " : " ASVspoof 2 0 1 9 _ L A _ a s v...

  72. [72]

    Look down and to the right, and slowly count out loud to three

  73. [73]

    Look down and to the middle, and slowly count out loud to three

  74. [74]

    Look down and to the left, and slowly count out loud to three

  75. [75]

    Look up and to the left, and slowly count out loud to three

  76. [76]

    Look up and to the middle, and slowly count out loud to three

  77. [77]

    Wave your hand back and forth across your face four times while counting out loud

    Look up and to the right, and slowly count out loud to three. Wave your hand back and forth across your face four times while counting out loud. Read each question aloud, followed immediately by your answer

  78. [78]

    What is my favorite food?

  79. [79]

    What is my favorite movie?

  80. [80]

    What did I have for breakfast?

Showing first 80 references.