Eroding Trust in Real Speech: A Large-Scale Study of Human Audio Deepfake Perception

Nicolas M. M\"uller; Wei Herng Choong

arxiv: 2605.26136 · v1 · pith:A7OFSNN2new · submitted 2026-05-21 · 💻 cs.SD · cs.AI

Eroding Trust in Real Speech: A Large-Scale Study of Human Audio Deepfake Perception

Nicolas M. M\"uller , Wei Herng Choong This is my paper

Pith reviewed 2026-06-30 15:37 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords audio deepfakeshuman perceptiontrust erosionspeech synthesisdeepfake detectionlistening studyvoice conversionreal vs synthetic speech

0 comments

The pith

Human accuracy at recognizing real speech fell from 72.7% to 64.1% while accuracy on fakes stayed nearly flat at 71%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents results from the largest listening study of audio deepfakes to date, with 35,532 judgments from 1,768 participants on samples from 138 synthesis systems. It compares these outcomes to a 2021 baseline and finds that listeners have not become better at spotting fakes but have instead become more likely to reject genuine speech as fake. The authors conclude that the main danger from improving deepfakes is not successful deception but growing distrust of authentic audio. This pattern held across system types, though commercial and language-model-based systems were the hardest for people to classify correctly. An automated detector used for reference kept accuracy above 94% in all conditions.

Core claim

The study establishes a skepticism shift in which accuracy on real speech samples declined from 72.7% in 2021 to 64.1% today while accuracy on fake samples remained nearly constant around 72%. Listeners are not failing to notice synthesis artifacts more often; they are increasingly labeling authentic speech as synthetic. Samples from commercial and autoregressive language-model systems were the most difficult for humans to classify correctly.

What carries the argument

A large-scale listening test that directly compares current participant judgments on real and synthetic speech against a 2021 baseline across 138 text-to-speech and voice-conversion systems.

If this is right

Voice recordings used as evidence in legal settings would face greater scrutiny even when authentic.
Systems that rely on voice for authentication would encounter more false rejections from cautious users.
Detection research would need to shift emphasis toward confirming real speech rather than only identifying fakes.
Newer commercial and autoregressive synthesis methods would require targeted improvements to match human perception patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Public awareness campaigns about deepfake prevalence might need to include calibration exercises so listeners do not over-correct by doubting everything.
Similar trust erosion could appear in other modalities such as video or text if exposure to high-quality fakes continues to rise.
Longitudinal studies that track the same individuals over time could separate individual learning effects from population-level shifts in skepticism.

Load-bearing premise

The 2021 baseline study used comparable participant pools, stimuli, and task designs so the drop in real-speech accuracy can be attributed to increased exposure to deepfakes rather than study differences.

What would settle it

A replication that uses the exact same real-speech samples, participant instructions, and demographic matching as the 2021 baseline but collects new judgments today and still finds the same real-speech accuracy as before would falsify the erosion claim.

Figures

Figures reproduced from arXiv: 2605.26136 by Nicolas M. M\"uller, Wei Herng Choong.

**Figure 2.** Figure 2: Demographic analysis of the 2026 study. (a) Accuracy by age bracket (18–49); each dot is one participant, horizontal [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Human (colored) and ML detector (grey) accuracy [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Learning effect: accuracy by round number (blue, [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Human accuracy for each individual TTS/VC system (fake samples only, minimum 10 judgments), sorted by descending [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

read the original abstract

Audio deepfakes have improved rapidly recently, yet their effect on human trust in real speech remains unstudied. We present the largest listening study on audio deepfake perception to date, collecting 35,532 judgments from 1,768 participants across 138 text-to-speech and voice conversion systems. Our central finding is a skepticism shift: compared to a 2021 baseline, human accuracy on fake samples barely changed (72.9% to 71.2%), but accuracy on real samples dropped from 72.7% to 64.1%. Participants are not worse at detecting synthesis artifacts; rather, they increasingly distrust authentic speech. Samples generated by commercial and autoregressive language model systems proved hardest to detect (61.3 - 65.9%), while those from traditional seq2seq and flow-matching models remain easier to spot (75.4 - 76.8%). An ML detector that served as a reference point maintained over 94.5% accuracy across all conditions. Our results suggest that the primary threat posed by modern deepfakes may not be mere deception, but the erosion of trust in genuine audio.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds a drop in accuracy on real audio from 72.7% to 64.1% versus a 2021 baseline but supplies no evidence the two studies are comparable on participants or stimuli.

read the letter

The main point here is the reported skepticism shift: fake detection held steady around 71-73% while real-speech accuracy fell to 64.1%. That numerical contrast comes from 35,532 judgments across 1,768 listeners and 138 systems, which is a solid scale for this kind of perception work.

What stands out is the size of the data collection and the inclusion of an ML detector that stayed above 94% across conditions. The breakdown by model type (commercial and autoregressive systems harder to spot than seq2seq or flow-matching) also gives concrete numbers that future studies can use. The paper sticks to direct reporting of human judgments without fitting extra parameters.

The soft spot is the baseline comparison. The drop is presented as evidence of greater deepfake exposure, yet the abstract gives no details on whether the 2021 study used matching listener pools, the same number or acoustic range of real samples, or identical task instructions. Without those checks, the attribution to skepticism rather than design differences stays open. The current study describes its own methods in more detail than the reference, which leaves the central claim resting on an unverified equivalence.

This is useful for anyone tracking audio forensics or media trust questions. It raises a practical issue worth testing further. I would send it to peer review so the methods section can be examined directly and the baseline match can be clarified or adjusted.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a large-scale empirical study on human perception of audio deepfakes, collecting 35,532 judgments from 1,768 participants across 138 synthesis systems. The key finding is a 'skepticism shift': while detection accuracy on fake audio remained roughly stable compared to a 2021 baseline (72.9% to 71.2%), accuracy on real audio declined from 72.7% to 64.1%. The authors interpret this as evidence that the primary threat of deepfakes is eroding trust in genuine speech rather than improving deception. Additional results compare detection difficulty across synthesis architectures and benchmark against an ML detector achieving >94.5% accuracy.

Significance. If the comparison to the 2021 baseline is methodologically sound, this work provides important evidence that exposure to deepfakes may be causing listeners to become more skeptical of authentic audio. The scale of the study (over 1,700 participants) and coverage of modern commercial and autoregressive systems strengthen the empirical contribution. The finding shifts focus from detection to trust erosion, which has implications for audio forensics, media literacy, and deployment of generative audio technologies.

major comments (1)

[Abstract and Methods (baseline comparison)] The central attribution of the accuracy drop on real samples (72.7% → 64.1%) to increased deepfake exposure requires that the 2021 baseline study be comparable in participant demographics, stimulus selection (number and acoustic variety of real samples), task design, and presentation conditions. The current study is described in detail, but the manuscript does not provide explicit equivalence checks or matching criteria for the baseline. Without this, alternative explanations based on methodological differences cannot be ruled out, undermining the skepticism-shift conclusion.

minor comments (2)

[Results (system comparisons)] The ranges given for commercial/autoregressive (61.3-65.9%) and traditional models (75.4-76.8%) would benefit from per-system breakdowns or confidence intervals to assess variability.
[Abstract] Clarify whether the 2021 baseline used the same response format (e.g., binary real/fake judgment) and if any adjustments were made for multiple comparisons in the statistical analysis.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of methodological comparability in our baseline comparison, which underpins the skepticism-shift interpretation. We address this point directly below and commit to revisions that strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and Methods (baseline comparison)] The central attribution of the accuracy drop on real samples (72.7% → 64.1%) to increased deepfake exposure requires that the 2021 baseline study be comparable in participant demographics, stimulus selection (number and acoustic variety of real samples), task design, and presentation conditions. The current study is described in detail, but the manuscript does not provide explicit equivalence checks or matching criteria for the baseline. Without this, alternative explanations based on methodological differences cannot be ruled out, undermining the skepticism-shift conclusion.

Authors: We agree that explicit equivalence checks are necessary to support causal attribution to deepfake exposure. The 2021 baseline is drawn from a cited prior study; our original manuscript summarized its key parameters but did not include a side-by-side methodological comparison. In revision we will add a dedicated subsection (likely in Methods or a new Appendix) that tabulates and discusses comparability on the four dimensions raised: (1) participant demographics (age, gender, location, prior exposure to deepfakes where reported), (2) stimulus selection (number of real samples, speaker diversity, acoustic conditions), (3) task design (binary real/fake judgment, instructions, number of trials per participant), and (4) presentation conditions (audio format, duration, platform, volume normalization). Where the baseline paper supplies the requisite details we will report quantitative matches or differences; where data are unavailable we will note the limitation and qualify the interpretation of the accuracy drop. This addition will allow readers to evaluate the strength of the skepticism-shift claim directly. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of human judgments with external baseline comparison

full rationale

The paper presents results from a large-scale listening study (35,532 judgments) and directly reports observed accuracy rates on real and fake audio samples. The central claim of a 'skepticism shift' is a summary statistic derived from new participant data compared against a cited 2021 baseline study. No equations, fitted parameters, predictions, ansatzes, or derivations appear in the provided text. The comparison to the baseline is an interpretive step resting on methodological equivalence assumptions, but this is an external validity issue rather than any reduction of the result to its own inputs by construction. No self-citations are load-bearing, and the study is self-contained as an empirical report against an independent prior dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study is observational and therefore rests on standard assumptions about human-subject data collection and cross-study comparability rather than mathematical axioms or new entities.

axioms (1)

domain assumption The 2021 baseline provides a valid counterfactual for measuring change in human perception.
The skepticism-shift interpretation depends on treating the earlier study as a matched control.

pith-pipeline@v0.9.1-grok · 5736 in / 1282 out tokens · 45831 ms · 2026-06-30T15:37:52.678704+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems33 (2020), 12449–12460

2020
[2]

Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, and Furu Wei. 2024. VALL-E 2: Neural Codec Language Mod- els are Human Parity Zero-Shot Text to Speech Synthesizers.arXiv preprint arXiv:2406.05370(2024)

work page arXiv 2024
[3]

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, et al. 2025. F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. InProc. ACL. 6255–6271

2025
[4]

Robert Chesney and Danielle Citron. 2019. Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security.California Law Review107 (2019), 1753–1820

2019
[5]

Di Cooke, Abigail Edwards, Sophia Barkoff, and Kathryn Kelly. 2025. As Good as a Coin Toss: Human Detection of AI-Generated Content.Commun. ACM68, 10 (2025)

2025
[6]

Deloitte Center for Financial Services. 2024. Generative AI Is Ex- pected to Magnify the Risk of Deepfakes and Other Fraud in Bank- ing. https://www2.deloitte.com/us/en/insights/industry/financial- services/financial-services-industry-predictions/2024/deepfake-banking- fraud-risk-on-the-rise.html

2024
[7]

Schröter, Karl F

Alexander Diel, Tania Lalgi, Isabel C. Schröter, Karl F. MacDorman, Martin Teufel, and Alexander Bäuerle. 2024. Human Performance in Detecting Deepfakes: A Systematic Review and Meta-Analysis of 56 Papers.Computers in Human Behavior Reports16 (2024). doi:10.1016/j.chbr.2024.100499

work page doi:10.1016/j.chbr.2024.100499 2024
[8]

Zhihao Du, Qian Chen, et al. 2024. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer Based on Supervised Semantic Tokens.arXiv preprint arXiv:2407.05407(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

ElevenLabs. 2024. ElevenLabs Text to Speech API. https://elevenlabs.io

2024
[10]

FBI San Francisco. 2026. FBI San Francisco Warns Romance Scams Increasing Across the Bay Area This Valentine’s Day. https://www.fbi.gov/contact-us/field- offices/sanfrancisco/fbi-san-francisco-warns-romance-scams-increasing- across-the-bay-area-this-valentines-day

2026
[11]

Federal Communications Commission. 2024. Proposed $6 Million Fine Against Political Consultant Who Used AI-Generated Deepfake Robocalls. https://docs. fcc.gov/public/attachments/DOC-402762A1.pdf

2024
[12]

Daniel Gover. 2024. Finance worker pays out $25 million after video call with deepfake ‘chief financial officer’.CNN(Feb 2024)

2024
[13]

Matthew Groh, Ziv Epstein, Chaz Firestone, and Rosalind Picard. 2022. Deep- fake Detection by Human Crowds, Machines, and Machine-Informed Crowds. Proceedings of the National Academy of Sciences119, 1 (2022)

2022
[14]

Keith Ito and Linda Johnson. 2017. The LJ Speech Dataset. https://keithito.com/LJ- Speech-Dataset/

2017
[15]

Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional Variational Au- toencoder with Adversarial Learning for End-to-End Text-to-Speech. InProc. ICML. 5530–5540

2021
[16]

Raghavan, Gavin Mischler, and Nima Mesgarani

Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, and Nima Mesgarani. 2023. StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. InProc. NeurIPS

2023
[17]

Xuechen Liu, Xin Wang, Md Sahidullah, Jose Patino, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas Evans, Andreas Nautsch, and Kong Aik Lee. 2023. ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild.IEEE/ACM Transactions on Audio, Speech and Language Processing(2023)

2023
[18]

Bray, Toby O

Khai Tinh Mai, Sergi D. Bray, Toby O. Davies, and Lewis D. Griffin. 2023. Warning: Humans Cannot Reliably Detect Speech Deepfakes.PLOS ONE(2023)

2023
[19]

McAfee. 2023. Beware the Artificial Impostor: A McAfee Study on the Rise of AI Scams. https://www.mcafee.com/learn/a-guide-to-deepfake-scams-and-ai- voice-spoofing/

2023
[20]

Müller, Pavel Czempin, Franziska Diekmann, Adam Froghyar, and Konstantin Böttinger

Nicolas M. Müller, Pavel Czempin, Franziska Diekmann, Adam Froghyar, and Konstantin Böttinger. 2022. Does Audio Deepfake Detection Generalize?. InProc. Interspeech. 2783–2787

2022
[21]

Müller, Piotr Kawa, Wei Herng Choong, et al

Nicolas M. Müller, Piotr Kawa, Wei Herng Choong, et al. 2024. MLAAD: The Multi-Language Audio Anti-Spoofing Dataset. InProc. IJCNN. doi:10.1109/ IJCNN60899.2024.10650962

work page arXiv 2024
[22]

Müller, Karla Pizzi, and Jennifer Williams

Nicolas M. Müller, Karla Pizzi, and Jennifer Williams. 2022. Human Perception of Audio Deepfakes. InProc. 1st International Workshop on Deepfake Detection for Audio Multimedia (DDAM). 85–91. doi:10.1145/3552466.3556531

work page doi:10.1145/3552466.3556531 2022
[23]

Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. 2021. Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech. InProc. ICML

2021
[24]

Zengyi Qin, Wenliang Zhao, Xumin Yu, and Xin Sun. 2024. OpenVoice: Versatile Instant Voice Cloning.arXiv preprint arXiv:2312.01479(2024)

work page arXiv 2024
[25]

Resemble AI. 2024. Resemble AI Speech Synthesis API. https://www.resemble.ai

2024
[26]

Resemble AI. 2025. Chatterbox TTS. https://github.com/resemble-ai/chatterbox

2025
[27]

Eugenia San Segundo, Aurora López-Jareño, Xin Wang, and Junichi Yamagishi
[28]

Human Perception of Audio Deepfakes: The Role of Language and Speaking Style.arXiv preprint arXiv:2512.09221(2025)

work page arXiv 2025
[29]

Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, et al

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, et al. 2018. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. InProc. ICASSP. 4779–4783

2018
[30]

Catherine Stupp. 2019. Fraudsters Used AI to Mimic CEO’s Voice in Unusual Cybercrime Case.The Wall Street Journal(Aug 2019)

2019
[31]

Suno AI. 2023. Bark: Text-to-Audio Model. https://github.com/suno-ai/bark

2023
[32]

Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. 2021. End-to-End Anti-Spoofing with RawNet2. In Proc. ICASSP. 6369–6373

2021
[33]

Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee weon Jung, Junichi Yamagishi, and Nicholas Evans. 2022. Automatic Speaker Verification Spoofing and Deepfake Detection Using Wav2Vec 2.0 and Data Augmentation. InProc. Speaker Odyssey

2022
[34]

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, et al. 2023. VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers.arXiv preprint arXiv:2301.02111(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Xin Wang, Héctor Delgado, Hemlata Tak, Jee weon Jung, et al. 2025. ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech.Computer Speech & Language(2025)

2025
[36]

Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Héctor Delgado, Andreas Nautsch, Nicholas Evans, et al . 2020. ASVspoof 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech.Computer Speech & Language64 (2020), 101114

2020
[37]

Kevin Warren, Tyler Tucker, Anna Crowder, Daniel Olszewski, Allison Lu, Car- oline Fedele, Magdalena Pasternak, Seth Layton, Kevin Butler, Carrie Gates, and Patrick Traynor. 2024. Better Be Computer or I’m Dumb: A Large- Scale Evaluation of Humans as Audio Deepfake Detectors. InProc. ACM CCS. doi:10.1145/3658644.3670325

work page doi:10.1145/3658644.3670325 2024
[38]

Jee weon Jung, Hee-Soo Heo, Hemlata Tak, Hye jin Shim, Joon Son Chung, Bong- Jin Lee, Ha-Jin Yu, and Nicholas Evans. 2022. AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks. InProc. ICASSP. 6367–6371. 6

2022

[1] [1]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems33 (2020), 12449–12460

2020

[2] [2]

Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, and Furu Wei. 2024. VALL-E 2: Neural Codec Language Mod- els are Human Parity Zero-Shot Text to Speech Synthesizers.arXiv preprint arXiv:2406.05370(2024)

work page arXiv 2024

[3] [3]

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, et al. 2025. F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. InProc. ACL. 6255–6271

2025

[4] [4]

Robert Chesney and Danielle Citron. 2019. Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security.California Law Review107 (2019), 1753–1820

2019

[5] [5]

Di Cooke, Abigail Edwards, Sophia Barkoff, and Kathryn Kelly. 2025. As Good as a Coin Toss: Human Detection of AI-Generated Content.Commun. ACM68, 10 (2025)

2025

[6] [6]

Deloitte Center for Financial Services. 2024. Generative AI Is Ex- pected to Magnify the Risk of Deepfakes and Other Fraud in Bank- ing. https://www2.deloitte.com/us/en/insights/industry/financial- services/financial-services-industry-predictions/2024/deepfake-banking- fraud-risk-on-the-rise.html

2024

[7] [7]

Schröter, Karl F

Alexander Diel, Tania Lalgi, Isabel C. Schröter, Karl F. MacDorman, Martin Teufel, and Alexander Bäuerle. 2024. Human Performance in Detecting Deepfakes: A Systematic Review and Meta-Analysis of 56 Papers.Computers in Human Behavior Reports16 (2024). doi:10.1016/j.chbr.2024.100499

work page doi:10.1016/j.chbr.2024.100499 2024

[8] [8]

Zhihao Du, Qian Chen, et al. 2024. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer Based on Supervised Semantic Tokens.arXiv preprint arXiv:2407.05407(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

ElevenLabs. 2024. ElevenLabs Text to Speech API. https://elevenlabs.io

2024

[10] [10]

FBI San Francisco. 2026. FBI San Francisco Warns Romance Scams Increasing Across the Bay Area This Valentine’s Day. https://www.fbi.gov/contact-us/field- offices/sanfrancisco/fbi-san-francisco-warns-romance-scams-increasing- across-the-bay-area-this-valentines-day

2026

[11] [11]

Federal Communications Commission. 2024. Proposed $6 Million Fine Against Political Consultant Who Used AI-Generated Deepfake Robocalls. https://docs. fcc.gov/public/attachments/DOC-402762A1.pdf

2024

[12] [12]

Daniel Gover. 2024. Finance worker pays out $25 million after video call with deepfake ‘chief financial officer’.CNN(Feb 2024)

2024

[13] [13]

Matthew Groh, Ziv Epstein, Chaz Firestone, and Rosalind Picard. 2022. Deep- fake Detection by Human Crowds, Machines, and Machine-Informed Crowds. Proceedings of the National Academy of Sciences119, 1 (2022)

2022

[14] [14]

Keith Ito and Linda Johnson. 2017. The LJ Speech Dataset. https://keithito.com/LJ- Speech-Dataset/

2017

[15] [15]

Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional Variational Au- toencoder with Adversarial Learning for End-to-End Text-to-Speech. InProc. ICML. 5530–5540

2021

[16] [16]

Raghavan, Gavin Mischler, and Nima Mesgarani

Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, and Nima Mesgarani. 2023. StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. InProc. NeurIPS

2023

[17] [17]

Xuechen Liu, Xin Wang, Md Sahidullah, Jose Patino, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas Evans, Andreas Nautsch, and Kong Aik Lee. 2023. ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild.IEEE/ACM Transactions on Audio, Speech and Language Processing(2023)

2023

[18] [18]

Bray, Toby O

Khai Tinh Mai, Sergi D. Bray, Toby O. Davies, and Lewis D. Griffin. 2023. Warning: Humans Cannot Reliably Detect Speech Deepfakes.PLOS ONE(2023)

2023

[19] [19]

McAfee. 2023. Beware the Artificial Impostor: A McAfee Study on the Rise of AI Scams. https://www.mcafee.com/learn/a-guide-to-deepfake-scams-and-ai- voice-spoofing/

2023

[20] [20]

Müller, Pavel Czempin, Franziska Diekmann, Adam Froghyar, and Konstantin Böttinger

Nicolas M. Müller, Pavel Czempin, Franziska Diekmann, Adam Froghyar, and Konstantin Böttinger. 2022. Does Audio Deepfake Detection Generalize?. InProc. Interspeech. 2783–2787

2022

[21] [21]

Müller, Piotr Kawa, Wei Herng Choong, et al

Nicolas M. Müller, Piotr Kawa, Wei Herng Choong, et al. 2024. MLAAD: The Multi-Language Audio Anti-Spoofing Dataset. InProc. IJCNN. doi:10.1109/ IJCNN60899.2024.10650962

work page arXiv 2024

[22] [22]

Müller, Karla Pizzi, and Jennifer Williams

Nicolas M. Müller, Karla Pizzi, and Jennifer Williams. 2022. Human Perception of Audio Deepfakes. InProc. 1st International Workshop on Deepfake Detection for Audio Multimedia (DDAM). 85–91. doi:10.1145/3552466.3556531

work page doi:10.1145/3552466.3556531 2022

[23] [23]

Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. 2021. Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech. InProc. ICML

2021

[24] [24]

Zengyi Qin, Wenliang Zhao, Xumin Yu, and Xin Sun. 2024. OpenVoice: Versatile Instant Voice Cloning.arXiv preprint arXiv:2312.01479(2024)

work page arXiv 2024

[25] [25]

Resemble AI. 2024. Resemble AI Speech Synthesis API. https://www.resemble.ai

2024

[26] [26]

Resemble AI. 2025. Chatterbox TTS. https://github.com/resemble-ai/chatterbox

2025

[27] [27]

Eugenia San Segundo, Aurora López-Jareño, Xin Wang, and Junichi Yamagishi

[28] [28]

Human Perception of Audio Deepfakes: The Role of Language and Speaking Style.arXiv preprint arXiv:2512.09221(2025)

work page arXiv 2025

[29] [29]

Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, et al

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, et al. 2018. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. InProc. ICASSP. 4779–4783

2018

[30] [30]

Catherine Stupp. 2019. Fraudsters Used AI to Mimic CEO’s Voice in Unusual Cybercrime Case.The Wall Street Journal(Aug 2019)

2019

[31] [31]

Suno AI. 2023. Bark: Text-to-Audio Model. https://github.com/suno-ai/bark

2023

[32] [32]

Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. 2021. End-to-End Anti-Spoofing with RawNet2. In Proc. ICASSP. 6369–6373

2021

[33] [33]

Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee weon Jung, Junichi Yamagishi, and Nicholas Evans. 2022. Automatic Speaker Verification Spoofing and Deepfake Detection Using Wav2Vec 2.0 and Data Augmentation. InProc. Speaker Odyssey

2022

[34] [34]

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, et al. 2023. VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers.arXiv preprint arXiv:2301.02111(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Xin Wang, Héctor Delgado, Hemlata Tak, Jee weon Jung, et al. 2025. ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech.Computer Speech & Language(2025)

2025

[36] [36]

Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Héctor Delgado, Andreas Nautsch, Nicholas Evans, et al . 2020. ASVspoof 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech.Computer Speech & Language64 (2020), 101114

2020

[37] [37]

Kevin Warren, Tyler Tucker, Anna Crowder, Daniel Olszewski, Allison Lu, Car- oline Fedele, Magdalena Pasternak, Seth Layton, Kevin Butler, Carrie Gates, and Patrick Traynor. 2024. Better Be Computer or I’m Dumb: A Large- Scale Evaluation of Humans as Audio Deepfake Detectors. InProc. ACM CCS. doi:10.1145/3658644.3670325

work page doi:10.1145/3658644.3670325 2024

[38] [38]

Jee weon Jung, Hee-Soo Heo, Hemlata Tak, Hye jin Shim, Joon Son Chung, Bong- Jin Lee, Ha-Jin Yu, and Nicholas Evans. 2022. AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks. InProc. ICASSP. 6367–6371. 6

2022