pith. sign in

arxiv: 2401.09512 · v10 · pith:W3ZDYWZBnew · submitted 2024-01-17 · 💻 cs.SD · eess.AS

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

Pith reviewed 2026-05-24 04:32 UTC · model grok-4.3

classification 💻 cs.SD eess.AS
keywords audio deepfake detectionanti-spoofing datasettext-to-speech synthesismulti-language audiosynthetic speechdeepfake detection modelsASVspoof comparison
0
0 comments X

The pith

MLAAD, built from 175 TTS systems across 54 languages, trains audio deepfake detectors that outperform those trained on prior datasets and complement ASVspoof 2019.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MLAAD as a large collection of synthetic audio for training and evaluating deepfake detectors. It draws on 175 distinct text-to-speech models to produce 1002.9 hours of voice data spanning 54 languages. Detectors trained on this resource show stronger results than those trained on smaller or less varied sets such as InTheWild and FakeOrReal. Across eight test datasets, MLAAD and the established ASVspoof 2019 dataset each lead on four of them.

Core claim

MLAAD supplies 1002.9 hours of synthetic audio generated by 175 TTS models in 54 languages. Training three state-of-the-art detectors on MLAAD produces better performance than training on InTheWild or FakeOrReal. In head-to-head tests across eight evaluation sets, MLAAD and ASVspoof 2019 each outperform the other on exactly four datasets. The authors release both the full dataset and a trained model through a public webserver.

What carries the argument

The MLAAD dataset itself, assembled from the outputs of 175 separate TTS models.

If this is right

  • Detectors trained on MLAAD will identify a broader set of synthetic voices than detectors trained on narrower existing collections.
  • Pairing MLAAD with ASVspoof 2019 supplies coverage that neither resource achieves alone.
  • The 54-language scope supports training detectors that remain effective outside English-dominant test conditions.
  • Public release of the dataset and webserver model allows non-specialists to build and test anti-spoofing tools directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future TTS systems that differ substantially from the 175 models used here could still evade detectors trained on MLAAD.
  • The same multi-source generation strategy might be applied to other modalities such as video to create similarly diverse training sets.
  • Performance differences across the 54 languages could guide targeted collection of additional data for under-performing languages.

Load-bearing premise

The synthetic voices produced by these 175 TTS models capture the range of features found in real-world deepfake audio.

What would settle it

A detection model trained only on MLAAD shows markedly lower accuracy on deepfakes generated by a new TTS system that was never used to create any of the dataset's samples.

Figures

Figures reproduced from arXiv: 2401.09512 by Edresson Casanova, Eren G\"olge, Konstantin B\"ottinger, Nicolas M. M\"uller, Philip Sperl, Piotr Kawa, Piotr Syga, Thorsten M\"uller, Wei Herng Choong.

Figure 1
Figure 1. Figure 1: Visualization of the data creation process for MLAAD. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Composition of MLAAD with respect to TTS model architecture (top), training dataset (middle) and language distribution [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

This paper presents the Multi-Language Audio Anti-Spoofing Dataset (MLAAD), version 10: a dataset of synthetic audio to train and evaluate audio deepfake detection models. It features 175 Text-to-Speech (TTS) models, comprising a total of 1002.9 hours of synthetic voice in 54 different languages. To evaluate this dataset, we train three state-of-the-art deepfake detection models with MLAAD and observe that it demonstrates superior performance to comparable datasets like InTheWild and FakeOrReal when used as a training resource. Moreover, compared to the renowned ASVspoof 2019 dataset, MLAAD proves to be a complementary resource. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, each excelling on four datasets. By publishing the dataset and making a trained model accessible via an interactive webserver, we aim to democratize anti-spoofing technology, making it accessible beyond the realm of specialists, and contributing to global efforts against audio spoofing and deepfakes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript presents the Multi-Language Audio Anti-Spoofing Dataset (MLAAD) version 10, comprising 1002.9 hours of synthetic audio from 175 TTS models across 54 languages. It claims that models trained on MLAAD achieve superior performance to those trained on InTheWild and FakeOrReal, and that MLAAD is complementary to ASVspoof 2019, with each resource outperforming the other on four of eight external test datasets.

Significance. Release of a large-scale multi-lingual synthetic audio corpus with 175 distinct TTS systems could provide a useful training resource for audio deepfake detection, particularly if the complementarity result with ASVspoof 2019 holds under controlled conditions. The accompanying webserver model further supports accessibility. The empirical superiority claims, however, rest on comparisons whose validity depends on addressing data-volume confounds and providing missing experimental details.

major comments (3)
  1. [Abstract] Abstract: the reported superiority of MLAAD over InTheWild and FakeOrReal as a training resource does not indicate whether training-set sizes (in hours or utterances) were matched across conditions. MLAAD's 1002.9 h is an order of magnitude larger than typical comparables; without subsampling or explicit volume controls, performance differences cannot be attributed to the 175 TTS models or 54-language coverage rather than data quantity.
  2. [Abstract (evaluation paragraph)] Abstract (evaluation paragraph): the central empirical claims lack any description of training hyperparameters, exact test-set construction protocols, or statistical significance testing, leaving the performance gains and the four/four alternation result only partially supported by the reported evidence.
  3. [Abstract] Abstract: the complementarity claim with ASVspoof 2019 across eight datasets requires explicit enumeration of those eight test collections, the precise evaluation metrics, and confirmation that identical model architectures and training schedules were used in all head-to-head comparisons.
minor comments (1)
  1. The manuscript should clarify the rationale for version numbering (v10) and any changes relative to earlier releases of the same corpus.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to strengthen the presentation of our empirical claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported superiority of MLAAD over InTheWild and FakeOrReal as a training resource does not indicate whether training-set sizes (in hours or utterances) were matched across conditions. MLAAD's 1002.9 h is an order of magnitude larger than typical comparables; without subsampling or explicit volume controls, performance differences cannot be attributed to the 175 TTS models or 54-language coverage rather than data quantity.

    Authors: We agree that the abstract does not indicate whether training-set sizes were matched. Our experiments used the full MLAAD (1002.9 hours) against the naturally smaller InTheWild and FakeOrReal corpora. While dataset volume is a contributing factor, the diversity across 175 TTS systems and 54 languages is intended to be the primary contribution. In revision we will explicitly report the sizes employed in each comparison and add a controlled experiment using a volume-matched subsample of MLAAD. revision: yes

  2. Referee: [Abstract (evaluation paragraph)] Abstract (evaluation paragraph): the central empirical claims lack any description of training hyperparameters, exact test-set construction protocols, or statistical significance testing, leaving the performance gains and the four/four alternation result only partially supported by the reported evidence.

    Authors: The abstract is space-constrained, yet we accept that the evaluation paragraph should reference these elements. The full manuscript contains the training hyperparameters, test-set construction details, and evaluation protocol in the experimental section. We will revise the abstract to include a concise reference to these aspects and ensure statistical significance testing (e.g., via bootstrap resampling) is explicitly described. revision: yes

  3. Referee: [Abstract] Abstract: the complementarity claim with ASVspoof 2019 across eight datasets requires explicit enumeration of those eight test collections, the precise evaluation metrics, and confirmation that identical model architectures and training schedules were used in all head-to-head comparisons.

    Authors: We agree that the abstract should enumerate the eight test collections, state the metrics (equal error rate), and confirm identical architectures and schedules. In the revision we will add this enumeration, either in the abstract or via a new table, and explicitly note that the same three detection models and training procedures were applied in all head-to-head comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation is externally grounded

full rationale

The paper constructs MLAAD and reports performance by training standard detectors on it then evaluating on eight external test collections (and comparing against models trained on InTheWild, FakeOrReal, ASVspoof 2019). These are direct experimental outcomes with no equations, fitted parameters, or self-definitions that reduce the reported metrics to quantities defined inside the paper. No self-citation chains or ansatzes are invoked to justify the central claims; the results remain falsifiable against the released dataset and external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset release paper. No mathematical derivations, free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5757 in / 1141 out tokens · 23411 ms · 2026-05-24T04:32:38.717856+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative

    cs.SD 2026-03 accept novelty 6.0

    RuASD is a comprehensive Russian speech anti-spoofing dataset featuring 37 synthesis systems and a robustness evaluation pipeline for real-world channel distortions.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Par- rotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation,

    F. Biadsy, R. J. Weiss, P. J. Moreno, D. Kanvesky, and Y . Jia, “Par- rotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation,” in Proc. Interspeech 2019 , 2019, pp. 4115–4119

  2. [2]

    Apple introduces new features for cognitive accessibility, along with live speech, personal voice, and point and speak in magnifier,

    Apple, “Apple introduces new features for cognitive accessibility, along with live speech, personal voice, and point and speak in magnifier,” https://www.apple.com/newsroom/2023/05/ apple-previews-live-speech-personal, 2023, accessed on 01/02/2024

  3. [3]

    Fraudsters cloned company director’s voice in $35 million bank heist, police find,

    Forbes, “Fraudsters cloned company director’s voice in $35 million bank heist, police find,” https://www.forbes.com/sites/thomasbrewster/2021/ 10/14/huge-bank-fraud-uses-deep-fake-voice-tech-to-steal-millions, 2023, accessed on 01/02/2024

  4. [4]

    Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,

    X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V . Vestman, T. Kinnunen, K. A. Lee et al. , “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” Computer Speech & Language , vol. 64, p. 101114, 2020

  5. [5]

    Content authenticity initiative,

    “Content authenticity initiative,” https://contentauthenticity.org/, (Ac- cessed on 01/02/2024)

  6. [6]

    Does Audio Deepfake Detection Generalize?

    N. M ¨uller, P. Czempin, F. Diekmann, A. Froghyar, and K. B ¨ottinger, “Does Audio Deepfake Detection Generalize?” in Proc. Interspeech 2022, 2022, pp. 2783–2787

  7. [7]

    ”ai putin

    T. Sun, “”ai putin” gave his chilling new year message, viewers convinced after ‘telltale sign’ spotted as death rumours swirl — the sun,” https://www.thesun.co.uk/news/25220488/ ai-putin-chilling-message-death-rumours/, (Accessed on 01/02/2024)

  8. [8]

    Deepfake elections: How indian politicians are using ai-manipulated media to malign,

    O. India, “Deepfake elections: How indian politicians are using ai-manipulated media to malign,” https://business.outlookindia.com/ technology/deepfake-elections-how-indian-politicians-are-using-ai, (Accessed on 01/02/2024)

  9. [9]

    Intelligence artificielle: quand un deepfake d’emmanuel macron emballe la toile,

    L. Croix, “Intelligence artificielle: quand un deepfake d’emmanuel macron emballe la toile,” https://www.la-croix.com/France/ Intelligence-artificielle-quand-deepfake-dEmmanuel-Macron, (Accessed on 01/02/2024)

  10. [10]

    The m-ailabs speech dataset,

    T. M.-A. S. Dataset, “The m-ailabs speech dataset,” https://www. caito.de/2019/01/03/the-m-ailabs-speech-dataset/, 2023, accessed on 01/02/2024

  11. [11]

    Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,

    Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilc ¸i, M. Sahidullah, and A. Sizov, “Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,” in Sixteenth annual confer- ence of the international speech communication association , 2015

  12. [12]

    Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,

    J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans et al., “Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,” arXiv preprint arXiv:2109.00537 , 2021

  13. [13]

    FakeA VCeleb: A novel audio-video multimodal deepfake dataset,

    H. Khalid, S. Tariq, M. Kim, and S. S. Woo, “FakeA VCeleb: A novel audio-video multimodal deepfake dataset,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , 2021. [Online]. Available: https://openreview.net/forum?id=TAXFsg6ZaOl

  14. [14]

    For: A dataset for synthetic speech detection,

    R. Reimao and V . Tzerpos, “For: A dataset for synthetic speech detection,” in 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) . IEEE, 2019, pp. 1–10

  15. [15]

    Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,

    X. Wang and J. Yamagishi, “Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–5

  16. [16]

    The partialspoof database and countermeasures for the detection of short fake speech segments embedded in an utterance,

    L. Zhang, X. Wang, E. Cooper, N. Evans, and J. Yamagishi, “The partialspoof database and countermeasures for the detection of short fake speech segments embedded in an utterance,”IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 31, pp. 813–825, 2023

  17. [17]

    WaveFake: A Data Set to Facilitate Audio Deepfake Detection,

    J. Frank and L. Sch ¨onherr, “WaveFake: A Data Set to Facilitate Audio Deepfake Detection,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2021

  18. [18]

    Add 2022: the first audio deep synthesis detection challenge,

    J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y . Bai, C. Fanet al., “Add 2022: the first audio deep synthesis detection challenge,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2022, pp. 9216–9220

  19. [19]

    Add 2023: the second audio deepfake detection challenge,

    J. Yi, J. Tao, R. Fu, X. Yan, C. Wang, T. Wang, C. Y . Zhang, X. Zhang, Y . Zhao, Y . Renet al., “Add 2023: the second audio deepfake detection challenge,” arXiv preprint arXiv:2305.13774 , 2023

  20. [20]

    Fmfcc-a: a challenging mandarin dataset for synthetic speech detection,

    Z. Zhang, Y . Gu, X. Yi, and X. Zhao, “Fmfcc-a: a challenging mandarin dataset for synthetic speech detection,” in International Workshop on Digital Watermarking. Springer, 2021, pp. 117–131

  21. [21]

    Half- Truth: A Partially Fake Audio Detection Dataset,

    J. Yi, Y . Bai, J. Tao, H. Ma, Z. Tian, C. Wang, T. Wang, and R. Fu, “Half- Truth: A Partially Fake Audio Detection Dataset,” in Proc. Interspeech 2021, 2021, pp. 1654–1658

  22. [22]

    Fad: A chinese dataset for fake audio detection,

    H. Ma, J. Yi, C. Wang, X. Yan, J. Tao, T. Wang, S. Wang, L. Xu, and R. Fu, “Fad: A chinese dataset for fake audio detection,” arXiv preprint arXiv:2207.12308, 2022

  23. [23]

    Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,

    J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerrv-Ryan et al. , “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 4779–4783

  24. [24]

    SpeedySpeech: Efficient Neural Speech Syn- thesis,

    J. Vainer and O. Du ˇsek, “SpeedySpeech: Efficient Neural Speech Syn- thesis,” in Proc. Interspeech 2020 , 2020, pp. 3575–3579

  25. [25]

    Better speech synthesis through scaling,

    J. Betker, “Better speech synthesis through scaling,” arXiv preprint arXiv:2305.07243, 2023

  26. [26]

    Xtts: Open model release announcement,

    Coqui.ai, “Xtts: Open model release announcement,” https://coqui.ai/ blog/tts/open xtts, (Accessed on 01/02/2024)

  27. [27]

    Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,

    J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning . PMLR, 2021, pp. 5530–5540

  28. [28]

    Fastpitch: Parallel text-to-speech with pitch prediction,

    A. Ła ´ncucki, “Fastpitch: Parallel text-to-speech with pitch prediction,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2021, pp. 6588–6592

  29. [29]

    Effective use of variational embedding capacity in expressive end-to-end speech synthesis,

    E. Battenberg, S. Mariooryad, D. Stanton, R. Skerry-Ryan, M. Shannon, D. Kao, and T. Bagby, “Effective use of variational embedding capacity in expressive end-to-end speech synthesis,” 2019

  30. [30]

    YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for everyone,

    E. Casanova, J. Weber, C. Shulby, A. C. Junior, E. G ¨olge, and M. Antonelli Ponti, “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for everyone,” arXiv e-prints , p. arXiv:2112.02418, Dec. 2021

  31. [31]

    Glow-tts: A generative flow for text-to-speech via monotonic alignment search,

    J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,” Advances in Neural Information Processing Systems , vol. 33, pp. 8067–8077, 2020

  32. [32]

    OverFlow: Putting flows on top of neural transducers for better TTS,

    S. Mehta, A. Kirkland, H. Lameris, J. Beskow, ´Eva Sz ´ekely, and G. E. Henter, “OverFlow: Putting flows on top of neural transducers for better TTS,” in Proc. INTERSPEECH 2023 , 2023, pp. 4279–4283

  33. [33]

    StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding V oice Conversion,

    Y . A. Li, A. Zare, and N. Mesgarani, “StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding V oice Conversion,” in Proc. Interspeech 2021 , 2021, pp. 1349–1353

  34. [34]

    Freevc: Towards high-quality text-free one-shot voice conversion,

    J. Li, W. Tu, and L. Xiao, “Freevc: Towards high-quality text-free one-shot voice conversion,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–5

  35. [35]

    ESPnet: End-to-end speech processing toolkit,

    S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proceedings of Interspeech, 2018, pp. 2207–2211. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1456

  36. [36]

    Coqui TTS,

    G. Eren and The Coqui TTS Team, “Coqui TTS,” Jan. 2021. [Online]. Available: https://github.com/coqui-ai/TTS

  37. [37]

    SpeechBrain

    M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y . Gao, R. De Mori, and Y . Bengio, “SpeechBrain.” [Online]. Available: https://github.com/speechbrain/speechbrain/

  38. [38]

    Transformers: State-of-the-Art Natural Language Processing

    T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, C. Ma, Y . Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: State-of-the-Art Natural Language Processing.” Association for Computational Linguistics, Oct. 2020, pp. 38–45. [Online]. Available: https://www.aclweb.org/anthology/2020.em...

  39. [39]

    The lj speech dataset,

    K. Ito and L. Johnson, “The lj speech dataset,” https://keithito.com/ LJ-Speech-Dataset/, 2017

  40. [40]

    The blizzard challenge 2013,

    S. King and V . Karaiskos, “The blizzard challenge 2013,” 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:166265879

  41. [41]

    CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages,

    K. Park and T. Mulc, “CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages,” in Proc. Interspeech 2019, 2019, pp. 1566– 1570

  42. [42]

    Common voice: A massively-multilingual speech corpus,

    R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of The 12th Language Resources and Evaluation Conference . Marseille, France: European Language Resources Association, May 2020, pp. 4218–4222. [Online]. Available: https:...

  43. [43]

    Thorsten-voice dataset 2022.10,

    T. M ¨uller and D. Kreutz, “Thorsten-voice dataset 2022.10,” October

  44. [44]

    Available: https://doi.org/10.5281/zenodo.7265581

    [Online]. Available: https://doi.org/10.5281/zenodo.7265581

  45. [45]

    Github - dioco-group/jenny-tts-dataset: A high-quality, varied ˜30hr voice dataset suitable for training a tts model,

    “Github - dioco-group/jenny-tts-dataset: A high-quality, varied ˜30hr voice dataset suitable for training a tts model,” https://github.com/ dioco-group/jenny-tts-dataset, (Accessed on 01/02/2024)

  46. [46]

    End-to-end anti-spoofing with rawnet2,

    H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with rawnet2,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6369–6373

  47. [47]

    End-to-end spectro-temporal graph attention networks for speaker ver- ification anti-spoofing and speech deepfake detection,

    H. Tak, J.-w. Jung, J. Patino, M. Kamble, M. Todisco, and N. Evans, “End-to-end spectro-temporal graph attention networks for speaker ver- ification anti-spoofing and speech deepfake detection,” arXiv preprint arXiv:2107.12710, 2021

  48. [48]

    Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,

    J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6367–6371

  49. [49]

    Raw differentiable architecture search for speech deepfake and spoofing detection,

    W. Ge, J. Patino, M. Todisco, and N. Evans, “Raw differentiable architecture search for speech deepfake and spoofing detection,” arXiv preprint arXiv:2107.12212, 2021

  50. [50]

    A comparison of features for synthetic speech detection,

    M. Sahidullah, T. Kinnunen, and C. Hanilc ¸i, “A comparison of features for synthetic speech detection,” in Proc. Interspeech 2015 , 2015, pp. 2087–2091

  51. [51]

    A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection,

    X. Wang and J. Yamagishi, “A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection,” in Proc. Interspeech 2021, 2021, pp. 4259–4263

  52. [52]

    Mesonet: a compact facial video forgery detection network,

    D. Afchar, V . Nozick, J. Yamagishi, and I. Echizen, “Mesonet: a compact facial video forgery detection network,” in 2018 IEEE International Workshop on Information Forensics and Security (WIFS), 2018, pp. 1–7

  53. [53]

    Specrnet: Towards faster and more ac- cessible audio deepfake detection,

    P. Kawa, M. Plata, and P. Syga, “Specrnet: Towards faster and more ac- cessible audio deepfake detection,” in 2022 IEEE International Confer- ence on Trust, Security and Privacy in Computing and Communications (TrustCom), 2022, pp. 792–799

  54. [54]

    Fake speech detection using residual network with transformer encoder,

    Z. Zhang, X. Yi, and X. Zhao, “Fake speech detection using residual network with transformer encoder,” in Proceedings of the 2021 ACM workshop on information hiding and multimedia security , 2021, pp. 13– 22

  55. [55]

    Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,

    H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” arXiv preprint arXiv:2202.12233, 2022

  56. [56]

    The vicomtech audio deepfake detection system based on wav2vec2 for the 2022 add challenge,

    J. M. Mart ´ın-Do˜nas and A. ´Alvarez, “The vicomtech audio deepfake detection system based on wav2vec2 for the 2022 add challenge,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2022, pp. 9241–9245

  57. [57]

    Improved DeepFake Detection Using Whisper Features,

    P. Kawa, M. Plata, M. Czuba, P. Szyma ´nski, and P. Syga, “Improved DeepFake Detection Using Whisper Features,” in Proc. INTERSPEECH 2023, 2023, pp. 4009–4013

  58. [58]

    Complex-valued neural networks for voice anti-spoofing,

    N. M. M ¨uller, P. Sperl, and K. B ¨ottinger, “Complex-valued neural networks for voice anti-spoofing,” in Proc. INTERSPEECH 2023, 2023, pp. 3814–3818

  59. [59]

    Av-deepfake1m: A large-scale llm-driven audio-visual deepfake dataset,

    “Av-deepfake1m: A large-scale llm-driven audio-visual deepfake dataset,” arXiv preprint arXiv:2311.15308 , 2023

  60. [60]

    nmt · pypi,

    “nmt · pypi,” https://pypi.org/project/nmt/, (Accessed on 01/15/2024)

  61. [61]

    Signal estimation from modified short-time fourier transform,

    D. Griffin and J. Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on acoustics, speech, and signal processing, vol. 32, no. 2, pp. 236–243, 1984

  62. [62]

    Scaling speech technology to 1,000+ languages,

    V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi et al. , “Scaling speech technology to 1,000+ languages,” arXiv preprint arXiv:2305.13516 , 2023

  63. [63]

    Sam: Accenture non-binary voice,

    Accenture, “Sam: Accenture non-binary voice,” https://github.com/ Sam-Accenture-Non-Binary-V oice/non-binary-voice-files#licensing, 2023, accessed on 01/02/2024

  64. [64]

    Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,

    H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” in The Speaker and Language Recognition Workshop, 2022

  65. [65]

    A study on data augmentation of reverberant speech for robust speech recognition,

    T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2017, pp. 5220–5224

  66. [66]

    ESC: Dataset for Environmental Sound Classification,

    K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proceedings of the 23rd Annual ACM Conference on Multimedia . ACM Press, pp. 1015–1018. [Online]. Available: http://dl.acm.org/ citation.cfm?doid=2733373.2806390

  67. [67]

    Free music archive - instrumental,

    F. M. Archive, “Free music archive - instrumental,” https://freemusicarchive.org/genre/Instrumental/, 2023, accessed on 01/02/2024

  68. [68]

    MUSAN: A Music, Speech, and Noise Corpus

    D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1

  69. [69]

    Speech is silver, silence is golden: What do asvspoof- trained models really learn?

    N. M. M ¨uller, F. Dieckmann, P. Czempin, R. Canals, K. B ¨ottinger, and J. Williams, “Speech is silver, silence is golden: What do asvspoof- trained models really learn?” arXiv preprint arXiv:2106.12914 , 2021

  70. [70]

    openai/whisper-large · hugging face,

    “openai/whisper-large · hugging face,” https://huggingface.co/openai/ whisper-large, (Accessed on 01/08/2024)

  71. [71]

    Binary codes capable of correcting deletions, insertions, and reversals,

    V . I. Levenshtein et al., “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady , vol. 10, no. 8. Soviet Union, 1966, pp. 707–710. APPENDIX architecture duration Azzurra-V oice 1.5 Bark 123.3 Capacitron 1.5 Chatterbox 14.5 E2 TTS 1.8 F5 TTS 1.7 FastPitch 2.1 FireRedTTS 5.5 FishTTS 7.4 GlowTTS 12.6 Griffin Li...