The DeepSpeak Dataset
Pith reviewed 2026-05-23 21:44 UTC · model grok-4.3
The pith
Current deepfake detectors fail to generalize to a new dataset of high-quality talking-head fakes without retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Without retraining, state-of-the-art deepfake detectors fail to generalize to the DeepSpeak dataset, which contains more than 50 hours of self-recorded real data from 500 participants and more than 50 hours of state-of-the-art deepfakes created with 14 video engines, three voice engines, and identity-matched face swaps.
What carries the argument
The DeepSpeak dataset, constructed via embedding-based identity matching to produce realistic audiovisual deepfakes from multiple current synthesis engines.
If this is right
- Detectors require retraining on data from the newest synthesis engines to regain effectiveness against talking-head deepfakes.
- Identity-matching protocols produce higher-quality simulated attacks than random pairing methods.
- Multimodal audiovisual coverage is required to evaluate detectors intended for video-conference and identity-verification settings.
Where Pith is reading between the lines
- Video-conference platforms could incorporate periodic retraining on datasets like this one to reduce impostor risks.
- Engine-specific failure patterns in the dataset could guide targeted improvements to detection methods.
Load-bearing premise
Deepfakes generated with the 14 video and three voice engines plus identity matching serve as realistic stand-ins for actual adversarial attacks.
What would settle it
Measure the accuracy of the tested detectors on the full DeepSpeak test set and compare it to their reported accuracy on their original training distributions.
Figures
read the original abstract
Deepfakes represent a growing concern across domains such as disinformation, fraud, and non-consensual media. In particular, the rise of video conference and identity-driven attacks in high-stakes scenarios--such as impostor hiring--demands new forensic resources. Despite significant efforts to develop robust detection classifiers to distinguish the real from the fake, commonly used training datasets remain inadequate: relying on low-quality and outdated deepfake generators, consisting of content scraped from online repositories without participant consent, lacking in multimodal coverage, and rarely employing identity-matching protocols to ensure realistic fakes. To overcome these limitations, we present the DeepSpeak dataset, a diverse and multimodal dataset comprising over 100 hours of authentic and deepfake audiovisual content, specifically focused on the challenging and diverse talking heads context. We contribute: i) more than 50 hours of real, self-recorded data collected from 500 diverse and consenting participants, ii) more than 50 hours of state-of-the-art audio and visual deepfakes generated using 14 video synthesis engines and three voice cloning engines, and iii) an embedding-based, identity-matching approach to ensure the creation of convincing, high-quality identity face swaps that realistically simulate adversarial deepfake attacks. We also perform large-scale evaluations of state-of-the-art deepfake detectors and show that, without retraining, these detectors fail to generalize to this DeepSpeak dataset, highlighting the importance of a large and diverse dataset containing deepfakes from the latest generative-AI tools.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the DeepSpeak dataset: over 100 hours of real and deepfake audiovisual talking-head content from 500 consenting participants, generated with 14 video synthesis engines, 3 voice cloning engines, and embedding-based identity matching. It reports large-scale evaluations claiming that state-of-the-art deepfake detectors fail to generalize to this dataset without retraining, underscoring the need for more diverse and modern training resources.
Significance. If the generalization experiments hold after addressing potential confounds, the dataset would provide a valuable, consent-based, multimodal benchmark focused on high-stakes talking-head scenarios and recent generators, potentially advancing detector robustness.
major comments (2)
- [Abstract] Abstract: the central claim that 'without retraining, these detectors fail to generalize' is presented without any reported metrics, list of tested models, quantitative results, or description of the evaluation protocol, preventing verification of the empirical finding.
- [Evaluation / Results] No section describes explicit controls, ablations, or matched real-video subsets to isolate the contribution of the 14 new synthesis engines from domain shift in the real class (self-recorded webcam videos with varied lighting/backgrounds versus the controlled corpora on which most detectors were trained, e.g., FaceForensics++). This attribution is load-bearing for the paper's main conclusion.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. We address the two major comments point by point below, indicating where revisions will be made to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'without retraining, these detectors fail to generalize' is presented without any reported metrics, list of tested models, quantitative results, or description of the evaluation protocol, preventing verification of the empirical finding.
Authors: The abstract is a concise summary and therefore omits the full list of models, exact metrics, and protocol details. These are provided in the Evaluation section of the manuscript, which reports results across multiple state-of-the-art detectors, the evaluation protocol (cross-dataset testing without retraining), and quantitative outcomes demonstrating poor generalization. To improve verifiability from the abstract itself, we will add a sentence summarizing the key quantitative finding (e.g., average AUC drop) while respecting length limits. revision: yes
-
Referee: [Evaluation / Results] No section describes explicit controls, ablations, or matched real-video subsets to isolate the contribution of the 14 new synthesis engines from domain shift in the real class (self-recorded webcam videos with varied lighting/backgrounds versus the controlled corpora on which most detectors were trained, e.g., FaceForensics++). This attribution is load-bearing for the paper's main conclusion.
Authors: We acknowledge that the current manuscript does not contain explicit ablations or matched real-video subsets that would fully isolate the effect of the 14 synthesis engines from potential domain shift in the real videos. The evaluation instead emphasizes end-to-end generalization failure on modern, identity-matched deepfakes generated from the new real recordings. We agree this is a substantive point and will add a dedicated ablation subsection using matched real subsets drawn from controlled corpora (e.g., FaceForensics++) to better attribute performance drops to the new generators versus real-video domain differences. revision: yes
Circularity Check
No circularity: empirical dataset release and benchmark
full rationale
The paper is a dataset contribution plus empirical evaluation of existing detectors on new data. No derivations, equations, fitted parameters renamed as predictions, or first-principles claims appear in the abstract or described content. The central result (detectors fail to generalize) is a direct empirical measurement, not a reduction to self-defined inputs or self-citations. The work is self-contained against external benchmarks with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 5 Pith papers
-
Scalable, Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection
Hybrid optical-digital architecture multiplexes 15+ video streams for parallel deepfake detection, reporting 97.79% average accuracy on Celeb-DF with resilience to degradation and attacks.
-
The Alpha Blending Hypothesis: Compositing Shortcut in Deepfake Detection
Deepfake detectors act as alpha blending searchers; training solely on self-blended real images yields top cross-dataset generalization on 15 datasets without using synthetic deepfakes.
-
Deepfake Detection that Generalizes Across Benchmarks
GenD achieves state-of-the-art average cross-dataset AUROC in deepfake detection by parameter-efficient adaptation of a foundational vision encoder with hyperspherical manifold enforcement via L2 normalization and met...
-
Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection
Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.
-
EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection
Emo-Boost augments low-level deepfake detectors with intra- and inter-modal emotion consistency checks to raise cross-manipulation generalization AUC by 2.1% on FakeAVCeleb.
Reference graph
Works this paper leans on
-
[1]
Deep audio-visual speech recognition
Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8717–8727, 2018
work page 2018
-
[2]
XLS-R: Self-supervised cross-lingual speech representation learning at scale
Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, and Michael Auli. XLS-R: Self-supervised cross-lingual speech representation learning at scale. In Interspeech, pages 2278–2282, 2022
work page 2022
-
[3]
Single and multi-speaker cloned voice detection: From perceptual to learned features
Sarah Barrington, Romit Barua, Gautham Koorma, and Hany Farid. Single and multi-speaker cloned voice detection: From perceptual to learned features. InIEEE International Workshop on Information Forensics and Security, pages 1–6. IEEE, 2023
work page 2023
-
[4]
People are poorly equipped to detect AI-powered voice clones
Sarah Barrington, Emily A Cooper, and Hany Farid. People are poorly equipped to detect AI-powered voice clones. Scientific Reports, 15(1):11004, 2025
work page 2025
-
[5]
Deepfakes and synthetic media in the financial system: Assessing threat scenarios
Jon Bateman. Deepfakes and synthetic media in the financial system: Assessing threat scenarios. Carnegie Endowment for International Peace., 2022
work page 2022
-
[6]
FreqNet: A frequency-domain image super-resolution network with dicrete cosine transform
Runyuan Cai, Yue Ding, and Hongtao Lu. FreqNet: A frequency-domain image super-resolution network with dicrete cosine transform. 2021
work page 2021
-
[7]
Zhixi Cai, Kalin Stefanov, Abhinav Dhall, and Munawar Hayat. Do you really mean that? Content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization. In International Conference on Digital Image Computing: Techniques and Applications), pages 1–10. IEEE, 2022
work page 2022
-
[8]
Simswap: An efficient framework for high fidelity face swapping
Renwang Chen, Xuanhong Chen, Bingbing Ni, and Yanhao Ge. Simswap: An efficient framework for high fidelity face swapping. In Proceedings of the 28th ACM international conference on multimedia, pages 2003–2011, 2020
work page 2003
-
[9]
VideoRetalking: Audio-based lip synchronization for talking head video editing in the wild
Kun Cheng, Xiaodong Cun, Yong Zhang, Menghan Xia, Fei Yin, Mingrui Zhu, Xuan Wang, Jue Wang, and Nannan Wang. VideoRetalking: Audio-based lip synchronization for talking head video editing in the wild. InSIGGRAPH Asia, pages 1–9, 2022
work page 2022
-
[10]
Deep fakes: A looming challenge for privacy, democracy, and national security
Bobby Chesney and Danielle Citron. Deep fakes: A looming challenge for privacy, democracy, and national security. Calif. L. Rev., 107:1753, 2019
work page 2019
-
[11]
V oxceleb2: Deep speaker recognition,
Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. V oxCeleb2: Deep speaker recognition. arXiv:1806.05622, 2018
-
[12]
Michelle L Ding and Harini Suresh. The malicious technical ecosystem: Exposing limi- tations in technical governance of ai-generated non-consensual intimate images of adults. arXiv:2504.17663, 2025
-
[13]
The DeepFake Detection Challenge (DFDC) Dataset
Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cris- tian Canton Ferrer. The deepfake detection challenge (DFDC) dataset. arXiv:2006.07397, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[14]
Contributing data to deepfake detection research
Nick Dufour and Andrew Gully. Contributing data to deepfake detection research. Google AI Blog, 2019
work page 2019
-
[15]
Emilio Ferrara. Charting the landscape of nefarious uses of generative artificial intelligence for online election interference. arXiv:2406.01862, 2024
-
[16]
J. H. Frank and L. Schönherr. WaveFake: A Data Set to Facilitate Audio DeepFake Detection. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, pages 1–18, June 2021
work page 2021
-
[17]
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. DARPA TIMIT acoustic phonetic continuous speech corpus, 1993. 10
work page 1993
-
[18]
Deepfake detection by human crowds, machines, and machine-informed crowds
Matthew Groh, Ziv Epstein, Chaz Firestone, and Rosalind Picard. Deepfake detection by human crowds, machines, and machine-informed crowds. Proceedings of the National Academy of Sciences, 119(1):e2110013119, 2022
work page 2022
-
[19]
LivePortrait: Efficient portrait animation with stitching and retargeting control
Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. LivePortrait: Efficient portrait animation with stitching and retargeting control. 2024
work page 2024
-
[20]
ForgeryNet: A versatile benchmark for comprehensive forgery analysis
Yinan He, Bei Gan, Siyu Chen, Yichun Zhou, Guojun Yin, Luchuan Song, Lu Sheng, Jing Shao, and Ziwei Liu. ForgeryNet: A versatile benchmark for comprehensive forgery analysis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4360–4369, 2021
work page 2021
-
[21]
Keith Ito and Linda Johnson. The lj speech dataset. https://keithito.com/LJ-Speech- Dataset/, 2017
work page 2017
-
[22]
Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, and Yonghui Wu. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 4485–4495, Red...
work page 2018
-
[23]
DeeperForensics-1.0: A large-scale dataset for real-world face forgery detection
Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. DeeperForensics-1.0: A large-scale dataset for real-world face forgery detection. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2889–2898, 2020
work page 2020
-
[24]
AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks
Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans. AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In IEEE International Conference on Acoustics, Ppeech and Signal Processing, pages 6367–6371, 2022
work page 2022
-
[25]
FakeA VCeleb: A novel audio-video multimodal deepfake dataset
Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S Woo. FakeA VCeleb: A novel audio-video multimodal deepfake dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021
work page 2021
-
[26]
Nithin Rao Koluguri, Taejin Park, and Boris Ginsburg. TitaNet: Neural model for speaker rep- resentation with 1D depth-wise separable convolutions and global context. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 8102–8106. IEEE, 2022
work page 2022
-
[27]
Towards automatic face-to-face translation
Prajwal KR, Rudrabha Mukhopadhyay, Jerin Philip, Abhishek Jha, Vinay Namboodiri, and CV Jawahar. Towards automatic face-to-face translation. In27th ACM International Conference on Multimedia, pages 1428–1436, 2019
work page 2019
-
[28]
LatentSync: Audio conditioned latent diffusion models for lip sync
Chunyu Li, Chao Zhang, Weikai Xu, Jinghui Xie, Weiguo Feng, Bingyue Peng, and Weiwei Xing. LatentSync: Audio conditioned latent diffusion models for lip sync. 2024
work page 2024
-
[29]
Celeb-DF: A large-scale challenging dataset for deepfake forensics
Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-DF: A large-scale challenging dataset for deepfake forensics. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3207–3216, 2020
work page 2020
-
[30]
Mingcong Liu, Qiang Li, Zekui Qin, Guoxin Zhang, Pengfei Wan, and Wen Zheng. BlendGAN: Implicitly GAN blending for arbitrary stylized face generation.Advances in Neural Information Processing Systems, 34:29710–29722, 2021
work page 2021
-
[31]
Weifeng Liu, Tianyi She, Jiawei Liu, Boheng Li, Dongyu Yao, and Run Wang. Lips are lying: Spotting the temporal inconsistency between audio and visual in lip-syncing deepfakes.Advances in Neural Information Processing Systems, 37:91131–91155, 2024
work page 2024
-
[32]
Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild.IEEE/ACM Trans
Xuechen Liu, Xin Wang, Md Sahidullah, Jose Patino, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas Evans, Andreas Nautsch, and Kong Aik Lee. Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild.IEEE/ACM Trans. Audio, Speech and Lang. Proc., 31:2507–2522, June 2023
work page 2021
-
[33]
Steven R Livingstone and Frank A Russo. The Ryerson audio-visual database of emotional speech and song (RA VDESS): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5):e0196391, 2018. 11
work page 2018
-
[34]
MediaPipe: A Framework for Building Perception Pipelines
Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. MediaPipe: A framework for building perception pipelines. arXiv:1906.08172, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[35]
Diff2Lip: Audio conditioned diffusion models for lip-synchronization
Soumik Mukhopadhyay, Saksham Suri, Ravi Teja Gadde, and Abhinav Shrivastava. Diff2Lip: Audio conditioned diffusion models for lip-synchronization. InIEEE/CVF Winter Conference on Applications of Computer Vision, pages 5292–5302, 2024
work page 2024
-
[36]
V oxceleb: a large-scale speaker identification dataset,
Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. V oxCeleb: A large-scale speaker identification dataset. arXiv:1706.08612, 2017
-
[37]
Sophie J. Nightingale and Hany Farid. AI-synthesized faces are indistinguishable from real faces and more trustworthy.Proceedings of the National Academy of Sciences, 119(8):e2120481119, 2022
work page 2022
-
[38]
Training-free deepfake voice recognition by leveraging large-scale pre-trained models
Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. Training-free deepfake voice recognition by leveraging large-scale pre-trained models. In ACM Workshop on Information Hiding and Multimedia Security, page 289–294, New York, NY , USA, 2024
work page 2024
-
[39]
A lip sync expert is all you need for speech to lip generation in the wild
KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. In28th ACM International Conference on Multimedia, pages 484–492, 2020
work page 2020
-
[40]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[41]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023
work page 2023
-
[42]
Facial geometric detail recovery via implicit representation
Xingyu Ren, Alexandros Lattas, Baris Gecer, Jiankang Deng, Chao Ma, and Xiaokang Yang. Facial geometric detail recovery via implicit representation. In IEEE 17th International Conference on Automatic Face and Gesture Recognition, 2023
work page 2023
-
[43]
FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces
Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. FaceForensics: A large-scale video dataset for forgery detection in human faces. arXiv:1803.09179, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[44]
Faceforensics++: Learning to detect manipulated facial images
Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. In IEEE/CVF International Conference on Computer Vision, pages 1–11, 2019
work page 2019
-
[45]
Ieee recommended practice for speech quality measurements
Ernst H Rothauser. Ieee recommended practice for speech quality measurements. IEEE Transactions on Audio and Electroacoustics, 17(3):225–246, 1969
work page 1969
- [46]
-
[47]
JSUT corpus: Free large-scale Japanese speech corpus for end-to-end speech synthesis, 2017
Ryosuke Sonobe, Shinnosuke Takamichi, and Hiroshi Saruwatari. JSUT corpus: Free large-scale Japanese speech corpus for end-to-end speech synthesis, 2017. arXiv preprint
work page 2017
-
[48]
AI-enabled influence operations: Safeguarding future elections
Sam Stockwell, Megan Hughes, Phil Swatton, Albert Zhang, Jonathan Hall KC, and Kieran. AI-enabled influence operations: Safeguarding future elections. Technical report, Centre for Emerging Technology and Security (CETaS), The Alan Turing Institute, November 2024
work page 2024
-
[49]
MultiTalk: Enhancing 3D talking head generation across languages with multilingual video dataset
Kim Sung-Bin, Lee Chae-Yeon, Gihun Son, Oh Hyun-Bin, Janghoon Ju, Suekyeong Nam, and Tae-Hyun Oh. MultiTalk: Enhancing 3D talking head generation across languages with multilingual video dataset. arXiv:2406.14272, 2024
-
[50]
Hemlata Tak, Jee-Weon Jung, Jose Patino, Madhu Kamble, Massimiliano Todisco, and Nicholas Evans. End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection. arXiv:2107.12710, 2021. 12
-
[51]
End-to-end anti-spoofing with RawNet2
Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. End-to-end anti-spoofing with RawNet2. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 6369–6373, 2021
work page 2021
-
[52]
Cristian Vaccari and Andrew Chadwick. Deepfakes and disinformation: Exploring the impact of synthetic political video on deception, uncertainty, and trust in news.Social media+ society, 6(1):2056305120903408, 2020
work page 2020
-
[53]
Designed to abuse? deepfakes and the non-consensual diffusion of intimate images
Marco Viola and Cristina V oto. Designed to abuse? deepfakes and the non-consensual diffusion of intimate images. Synthese, 201(1):30, 2023
work page 2023
-
[54]
INSwapper: Face swapping model based on insightface
Haofan Wang. INSwapper: Face swapping model based on insightface. https : //github.com/haofanwang/inswapper, 2023
work page 2023
-
[55]
Mead: A large-scale audio-visual dataset for emotional talking- face generation
Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. Mead: A large-scale audio-visual dataset for emotional talking- face generation. In European Conference on Computer Vision, pages 700–717. Springer, 2020
work page 2020
-
[56]
Zhouxia Wang, Jiawei Zhang, Tianshui Chen, Wenping Wang, and Ping Luo. RestoreFormer++: Towards real-world blind face restoration from undegraded key-value pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023
work page 2023
-
[57]
Steven Weinberger. Speech accent archive, 2015. Retrieved from the Speech Accent Archive
work page 2015
-
[58]
Deepfake video detection using generative convolutional vision transformer
Deressa Wodajo, Solomon Atnafu, and Zahid Akhtar. Deepfake video detection using generative convolutional vision transformer. 2023
work page 2023
-
[59]
Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2023
work page 2023
-
[60]
VFHQ: A high-quality dataset and benchmark for video face super-resolution
Liangbin Xie, Xintao Wang, Honglun Zhang, Chao Dong, and Ying Shan. VFHQ: A high-quality dataset and benchmark for video face super-resolution. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 657–666, 2022
work page 2022
-
[61]
DF40: Toward next-generation deepfake detection
Zhiyuan Yan, Taiping Yao, Shen Chen, Yandan Zhao, Xinghe Fu, Junwei Zhu, Donghao Luo, Chengjie Wang, Shouhong Ding, Yunsheng Wu, et al. DF40: Toward next-generation deepfake detection. arXiv:2406.13495, 2024
-
[62]
LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild
Shuang Yang, Yuanhang Zhang, Dalu Feng, Mingmin Yang, Chenhao Wang, Jingyun Xiao, Keyu Long, Shiguang Shan, and Xilin Chen. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. In IEEE International Conference on Automatic Face & Gesture Recognition, pages 1–8. IEEE, 2019
work page 2019
-
[63]
Shengkai Zhang, Nianhong Jiao, Tian Li, Chaojie Yang, Chenhui Xue, Boya Niu, and Jun Gao. HelloMeme: Integrating spatial knitting attentions to embed high-level and fidelity-rich conditions in diffusion models. 2024
work page 2024
-
[64]
Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset
Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021
work page 2021
-
[65]
MEMO: Memory-guided diffusion for expressive talking video generation
Longtao Zheng, Yifan Zhang, Hanzhong Guo, Jiachun Pan, Zhenxiong Tan, Jiahao Lu, Chuanxin Tang, Bo An, and Shuicheng Yan. MEMO: Memory-guided diffusion for expressive talking video generation. 2024
work page 2024
-
[66]
Towards robust blind face restoration with codebook lookup transformer
Shangchen Zhou, Kelvin Chan, Chongyi Li, and Chen Change Loy. Towards robust blind face restoration with codebook lookup transformer. Advances in Neural Information Processing Systems, 35:30599–30611, 2022
work page 2022
-
[67]
Chan, Chongyi Li, and Chen Change Loy
Shangchen Zhou, Kelvin C.K. Chan, Chongyi Li, and Chen Change Loy. Towards robust blind face restoration with codebook lookup transformer. InNeurIPS, 2022. 13
work page 2022
-
[68]
CelebV-HQ: A large-scale video facial attributes dataset
Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. CelebV-HQ: A large-scale video facial attributes dataset. In European Conference on Computer Vision, pages 650–667. Springer, 2022
work page 2022
-
[69]
Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. WildDeepfake: A challenging real-world dataset for deepfake detection. In28th ACM International Conference on Multimedia, pages 2382–2390, 2020. 14 Appendix Table of Contents A Release and Usage Information 17 B Related Work 18 B.1 Video . . . . . . . . . . . . . . . . . . . . . . . . ...
work page 2020
-
[70]
LogisticRegression: max iterations = 1000, with all other parameters as per the defaults in the Scikit-learn LogisticRegression model
-
[71]
RandomForestClassifier: number of estimators = 100, with all other parameters as per the defaults in the Scikit-learn RandomForestClassifier model. 36 L.3 Audio Raw Waveform Pretrained Model Configurations Listing 1: AASIST Full Configuration { " d a t a b a s e _ p a t h " : " ./ LA / " , " a s v _ s c o r e _ p a t h " : " ASVspoof 2 0 1 9 _ L A _ a s v...
work page 2025
-
[72]
Look down and to the right, and slowly count out loud to three
-
[73]
Look down and to the middle, and slowly count out loud to three
-
[74]
Look down and to the left, and slowly count out loud to three
-
[75]
Look up and to the left, and slowly count out loud to three
-
[76]
Look up and to the middle, and slowly count out loud to three
-
[77]
Wave your hand back and forth across your face four times while counting out loud
Look up and to the right, and slowly count out loud to three. Wave your hand back and forth across your face four times while counting out loud. Read each question aloud, followed immediately by your answer
-
[78]
What is my favorite food?
-
[79]
What is my favorite movie?
-
[80]
What did I have for breakfast?
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.