MLAAD: The Multi-Language Audio Anti-Spoofing Dataset
Pith reviewed 2026-05-24 04:32 UTC · model grok-4.3
The pith
MLAAD, built from 175 TTS systems across 54 languages, trains audio deepfake detectors that outperform those trained on prior datasets and complement ASVspoof 2019.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MLAAD supplies 1002.9 hours of synthetic audio generated by 175 TTS models in 54 languages. Training three state-of-the-art detectors on MLAAD produces better performance than training on InTheWild or FakeOrReal. In head-to-head tests across eight evaluation sets, MLAAD and ASVspoof 2019 each outperform the other on exactly four datasets. The authors release both the full dataset and a trained model through a public webserver.
What carries the argument
The MLAAD dataset itself, assembled from the outputs of 175 separate TTS models.
If this is right
- Detectors trained on MLAAD will identify a broader set of synthetic voices than detectors trained on narrower existing collections.
- Pairing MLAAD with ASVspoof 2019 supplies coverage that neither resource achieves alone.
- The 54-language scope supports training detectors that remain effective outside English-dominant test conditions.
- Public release of the dataset and webserver model allows non-specialists to build and test anti-spoofing tools directly.
Where Pith is reading between the lines
- Future TTS systems that differ substantially from the 175 models used here could still evade detectors trained on MLAAD.
- The same multi-source generation strategy might be applied to other modalities such as video to create similarly diverse training sets.
- Performance differences across the 54 languages could guide targeted collection of additional data for under-performing languages.
Load-bearing premise
The synthetic voices produced by these 175 TTS models capture the range of features found in real-world deepfake audio.
What would settle it
A detection model trained only on MLAAD shows markedly lower accuracy on deepfakes generated by a new TTS system that was never used to create any of the dataset's samples.
Figures
read the original abstract
This paper presents the Multi-Language Audio Anti-Spoofing Dataset (MLAAD), version 10: a dataset of synthetic audio to train and evaluate audio deepfake detection models. It features 175 Text-to-Speech (TTS) models, comprising a total of 1002.9 hours of synthetic voice in 54 different languages. To evaluate this dataset, we train three state-of-the-art deepfake detection models with MLAAD and observe that it demonstrates superior performance to comparable datasets like InTheWild and FakeOrReal when used as a training resource. Moreover, compared to the renowned ASVspoof 2019 dataset, MLAAD proves to be a complementary resource. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, each excelling on four datasets. By publishing the dataset and making a trained model accessible via an interactive webserver, we aim to democratize anti-spoofing technology, making it accessible beyond the realm of specialists, and contributing to global efforts against audio spoofing and deepfakes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the Multi-Language Audio Anti-Spoofing Dataset (MLAAD) version 10, comprising 1002.9 hours of synthetic audio from 175 TTS models across 54 languages. It claims that models trained on MLAAD achieve superior performance to those trained on InTheWild and FakeOrReal, and that MLAAD is complementary to ASVspoof 2019, with each resource outperforming the other on four of eight external test datasets.
Significance. Release of a large-scale multi-lingual synthetic audio corpus with 175 distinct TTS systems could provide a useful training resource for audio deepfake detection, particularly if the complementarity result with ASVspoof 2019 holds under controlled conditions. The accompanying webserver model further supports accessibility. The empirical superiority claims, however, rest on comparisons whose validity depends on addressing data-volume confounds and providing missing experimental details.
major comments (3)
- [Abstract] Abstract: the reported superiority of MLAAD over InTheWild and FakeOrReal as a training resource does not indicate whether training-set sizes (in hours or utterances) were matched across conditions. MLAAD's 1002.9 h is an order of magnitude larger than typical comparables; without subsampling or explicit volume controls, performance differences cannot be attributed to the 175 TTS models or 54-language coverage rather than data quantity.
- [Abstract (evaluation paragraph)] Abstract (evaluation paragraph): the central empirical claims lack any description of training hyperparameters, exact test-set construction protocols, or statistical significance testing, leaving the performance gains and the four/four alternation result only partially supported by the reported evidence.
- [Abstract] Abstract: the complementarity claim with ASVspoof 2019 across eight datasets requires explicit enumeration of those eight test collections, the precise evaluation metrics, and confirmation that identical model architectures and training schedules were used in all head-to-head comparisons.
minor comments (1)
- The manuscript should clarify the rationale for version numbering (v10) and any changes relative to earlier releases of the same corpus.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to strengthen the presentation of our empirical claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported superiority of MLAAD over InTheWild and FakeOrReal as a training resource does not indicate whether training-set sizes (in hours or utterances) were matched across conditions. MLAAD's 1002.9 h is an order of magnitude larger than typical comparables; without subsampling or explicit volume controls, performance differences cannot be attributed to the 175 TTS models or 54-language coverage rather than data quantity.
Authors: We agree that the abstract does not indicate whether training-set sizes were matched. Our experiments used the full MLAAD (1002.9 hours) against the naturally smaller InTheWild and FakeOrReal corpora. While dataset volume is a contributing factor, the diversity across 175 TTS systems and 54 languages is intended to be the primary contribution. In revision we will explicitly report the sizes employed in each comparison and add a controlled experiment using a volume-matched subsample of MLAAD. revision: yes
-
Referee: [Abstract (evaluation paragraph)] Abstract (evaluation paragraph): the central empirical claims lack any description of training hyperparameters, exact test-set construction protocols, or statistical significance testing, leaving the performance gains and the four/four alternation result only partially supported by the reported evidence.
Authors: The abstract is space-constrained, yet we accept that the evaluation paragraph should reference these elements. The full manuscript contains the training hyperparameters, test-set construction details, and evaluation protocol in the experimental section. We will revise the abstract to include a concise reference to these aspects and ensure statistical significance testing (e.g., via bootstrap resampling) is explicitly described. revision: yes
-
Referee: [Abstract] Abstract: the complementarity claim with ASVspoof 2019 across eight datasets requires explicit enumeration of those eight test collections, the precise evaluation metrics, and confirmation that identical model architectures and training schedules were used in all head-to-head comparisons.
Authors: We agree that the abstract should enumerate the eight test collections, state the metrics (equal error rate), and confirm identical architectures and schedules. In the revision we will add this enumeration, either in the abstract or via a new table, and explicitly note that the same three detection models and training procedures were applied in all head-to-head comparisons. revision: yes
Circularity Check
No circularity; empirical evaluation is externally grounded
full rationale
The paper constructs MLAAD and reports performance by training standard detectors on it then evaluating on eight external test collections (and comparing against models trained on InTheWild, FakeOrReal, ASVspoof 2019). These are direct experimental outcomes with no equations, fitted parameters, or self-definitions that reduce the reported metrics to quantities defined inside the paper. No self-citation chains or ansatzes are invoked to justify the central claims; the results remain falsifiable against the released dataset and external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative
RuASD is a comprehensive Russian speech anti-spoofing dataset featuring 37 synthesis systems and a robustness evaluation pipeline for real-world channel distortions.
Reference graph
Works this paper leans on
-
[1]
F. Biadsy, R. J. Weiss, P. J. Moreno, D. Kanvesky, and Y . Jia, “Par- rotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation,” in Proc. Interspeech 2019 , 2019, pp. 4115–4119
work page 2019
-
[2]
Apple, “Apple introduces new features for cognitive accessibility, along with live speech, personal voice, and point and speak in magnifier,” https://www.apple.com/newsroom/2023/05/ apple-previews-live-speech-personal, 2023, accessed on 01/02/2024
work page 2023
-
[3]
Fraudsters cloned company director’s voice in $35 million bank heist, police find,
Forbes, “Fraudsters cloned company director’s voice in $35 million bank heist, police find,” https://www.forbes.com/sites/thomasbrewster/2021/ 10/14/huge-bank-fraud-uses-deep-fake-voice-tech-to-steal-millions, 2023, accessed on 01/02/2024
work page 2021
-
[4]
Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,
X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V . Vestman, T. Kinnunen, K. A. Lee et al. , “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” Computer Speech & Language , vol. 64, p. 101114, 2020
work page 2019
-
[5]
Content authenticity initiative,
“Content authenticity initiative,” https://contentauthenticity.org/, (Ac- cessed on 01/02/2024)
work page 2024
-
[6]
Does Audio Deepfake Detection Generalize?
N. M ¨uller, P. Czempin, F. Diekmann, A. Froghyar, and K. B ¨ottinger, “Does Audio Deepfake Detection Generalize?” in Proc. Interspeech 2022, 2022, pp. 2783–2787
work page 2022
- [7]
-
[8]
Deepfake elections: How indian politicians are using ai-manipulated media to malign,
O. India, “Deepfake elections: How indian politicians are using ai-manipulated media to malign,” https://business.outlookindia.com/ technology/deepfake-elections-how-indian-politicians-are-using-ai, (Accessed on 01/02/2024)
work page 2024
-
[9]
Intelligence artificielle: quand un deepfake d’emmanuel macron emballe la toile,
L. Croix, “Intelligence artificielle: quand un deepfake d’emmanuel macron emballe la toile,” https://www.la-croix.com/France/ Intelligence-artificielle-quand-deepfake-dEmmanuel-Macron, (Accessed on 01/02/2024)
work page 2024
-
[10]
T. M.-A. S. Dataset, “The m-ailabs speech dataset,” https://www. caito.de/2019/01/03/the-m-ailabs-speech-dataset/, 2023, accessed on 01/02/2024
work page 2019
-
[11]
Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,
Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilc ¸i, M. Sahidullah, and A. Sizov, “Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,” in Sixteenth annual confer- ence of the international speech communication association , 2015
work page 2015
-
[12]
Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,
J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans et al., “Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,” arXiv preprint arXiv:2109.00537 , 2021
-
[13]
FakeA VCeleb: A novel audio-video multimodal deepfake dataset,
H. Khalid, S. Tariq, M. Kim, and S. S. Woo, “FakeA VCeleb: A novel audio-video multimodal deepfake dataset,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , 2021. [Online]. Available: https://openreview.net/forum?id=TAXFsg6ZaOl
work page 2021
-
[14]
For: A dataset for synthetic speech detection,
R. Reimao and V . Tzerpos, “For: A dataset for synthetic speech detection,” in 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) . IEEE, 2019, pp. 1–10
work page 2019
-
[15]
X. Wang and J. Yamagishi, “Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–5
work page 2023
-
[16]
L. Zhang, X. Wang, E. Cooper, N. Evans, and J. Yamagishi, “The partialspoof database and countermeasures for the detection of short fake speech segments embedded in an utterance,”IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 31, pp. 813–825, 2023
work page 2023
-
[17]
WaveFake: A Data Set to Facilitate Audio Deepfake Detection,
J. Frank and L. Sch ¨onherr, “WaveFake: A Data Set to Facilitate Audio Deepfake Detection,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2021
work page 2021
-
[18]
Add 2022: the first audio deep synthesis detection challenge,
J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y . Bai, C. Fanet al., “Add 2022: the first audio deep synthesis detection challenge,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2022, pp. 9216–9220
work page 2022
-
[19]
Add 2023: the second audio deepfake detection challenge,
J. Yi, J. Tao, R. Fu, X. Yan, C. Wang, T. Wang, C. Y . Zhang, X. Zhang, Y . Zhao, Y . Renet al., “Add 2023: the second audio deepfake detection challenge,” arXiv preprint arXiv:2305.13774 , 2023
-
[20]
Fmfcc-a: a challenging mandarin dataset for synthetic speech detection,
Z. Zhang, Y . Gu, X. Yi, and X. Zhao, “Fmfcc-a: a challenging mandarin dataset for synthetic speech detection,” in International Workshop on Digital Watermarking. Springer, 2021, pp. 117–131
work page 2021
-
[21]
Half- Truth: A Partially Fake Audio Detection Dataset,
J. Yi, Y . Bai, J. Tao, H. Ma, Z. Tian, C. Wang, T. Wang, and R. Fu, “Half- Truth: A Partially Fake Audio Detection Dataset,” in Proc. Interspeech 2021, 2021, pp. 1654–1658
work page 2021
-
[22]
Fad: A chinese dataset for fake audio detection,
H. Ma, J. Yi, C. Wang, X. Yan, J. Tao, T. Wang, S. Wang, L. Xu, and R. Fu, “Fad: A chinese dataset for fake audio detection,” arXiv preprint arXiv:2207.12308, 2022
-
[23]
Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,
J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerrv-Ryan et al. , “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 4779–4783
work page 2018
-
[24]
SpeedySpeech: Efficient Neural Speech Syn- thesis,
J. Vainer and O. Du ˇsek, “SpeedySpeech: Efficient Neural Speech Syn- thesis,” in Proc. Interspeech 2020 , 2020, pp. 3575–3579
work page 2020
-
[25]
Better speech synthesis through scaling,
J. Betker, “Better speech synthesis through scaling,” arXiv preprint arXiv:2305.07243, 2023
-
[26]
Xtts: Open model release announcement,
Coqui.ai, “Xtts: Open model release announcement,” https://coqui.ai/ blog/tts/open xtts, (Accessed on 01/02/2024)
work page 2024
-
[27]
Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,
J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning . PMLR, 2021, pp. 5530–5540
work page 2021
-
[28]
Fastpitch: Parallel text-to-speech with pitch prediction,
A. Ła ´ncucki, “Fastpitch: Parallel text-to-speech with pitch prediction,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2021, pp. 6588–6592
work page 2021
-
[29]
Effective use of variational embedding capacity in expressive end-to-end speech synthesis,
E. Battenberg, S. Mariooryad, D. Stanton, R. Skerry-Ryan, M. Shannon, D. Kao, and T. Bagby, “Effective use of variational embedding capacity in expressive end-to-end speech synthesis,” 2019
work page 2019
-
[30]
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for everyone,
E. Casanova, J. Weber, C. Shulby, A. C. Junior, E. G ¨olge, and M. Antonelli Ponti, “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for everyone,” arXiv e-prints , p. arXiv:2112.02418, Dec. 2021
-
[31]
Glow-tts: A generative flow for text-to-speech via monotonic alignment search,
J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,” Advances in Neural Information Processing Systems , vol. 33, pp. 8067–8077, 2020
work page 2020
-
[32]
OverFlow: Putting flows on top of neural transducers for better TTS,
S. Mehta, A. Kirkland, H. Lameris, J. Beskow, ´Eva Sz ´ekely, and G. E. Henter, “OverFlow: Putting flows on top of neural transducers for better TTS,” in Proc. INTERSPEECH 2023 , 2023, pp. 4279–4283
work page 2023
-
[33]
Y . A. Li, A. Zare, and N. Mesgarani, “StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding V oice Conversion,” in Proc. Interspeech 2021 , 2021, pp. 1349–1353
work page 2021
-
[34]
Freevc: Towards high-quality text-free one-shot voice conversion,
J. Li, W. Tu, and L. Xiao, “Freevc: Towards high-quality text-free one-shot voice conversion,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–5
work page 2023
-
[35]
ESPnet: End-to-end speech processing toolkit,
S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proceedings of Interspeech, 2018, pp. 2207–2211. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1456
-
[36]
G. Eren and The Coqui TTS Team, “Coqui TTS,” Jan. 2021. [Online]. Available: https://github.com/coqui-ai/TTS
work page 2021
-
[37]
M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y . Gao, R. De Mori, and Y . Bengio, “SpeechBrain.” [Online]. Available: https://github.com/speechbrain/speechbrain/
-
[38]
Transformers: State-of-the-Art Natural Language Processing
T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, C. Ma, Y . Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: State-of-the-Art Natural Language Processing.” Association for Computational Linguistics, Oct. 2020, pp. 38–45. [Online]. Available: https://www.aclweb.org/anthology/2020.em...
work page 2020
-
[39]
K. Ito and L. Johnson, “The lj speech dataset,” https://keithito.com/ LJ-Speech-Dataset/, 2017
work page 2017
-
[40]
S. King and V . Karaiskos, “The blizzard challenge 2013,” 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:166265879
work page 2013
-
[41]
CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages,
K. Park and T. Mulc, “CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages,” in Proc. Interspeech 2019, 2019, pp. 1566– 1570
work page 2019
-
[42]
Common voice: A massively-multilingual speech corpus,
R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of The 12th Language Resources and Evaluation Conference . Marseille, France: European Language Resources Association, May 2020, pp. 4218–4222. [Online]. Available: https:...
work page 2020
-
[43]
Thorsten-voice dataset 2022.10,
T. M ¨uller and D. Kreutz, “Thorsten-voice dataset 2022.10,” October
work page 2022
-
[44]
Available: https://doi.org/10.5281/zenodo.7265581
[Online]. Available: https://doi.org/10.5281/zenodo.7265581
-
[45]
“Github - dioco-group/jenny-tts-dataset: A high-quality, varied ˜30hr voice dataset suitable for training a tts model,” https://github.com/ dioco-group/jenny-tts-dataset, (Accessed on 01/02/2024)
work page 2024
-
[46]
End-to-end anti-spoofing with rawnet2,
H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with rawnet2,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6369–6373
work page 2021
-
[47]
H. Tak, J.-w. Jung, J. Patino, M. Kamble, M. Todisco, and N. Evans, “End-to-end spectro-temporal graph attention networks for speaker ver- ification anti-spoofing and speech deepfake detection,” arXiv preprint arXiv:2107.12710, 2021
-
[48]
Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,
J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6367–6371
work page 2022
-
[49]
Raw differentiable architecture search for speech deepfake and spoofing detection,
W. Ge, J. Patino, M. Todisco, and N. Evans, “Raw differentiable architecture search for speech deepfake and spoofing detection,” arXiv preprint arXiv:2107.12212, 2021
-
[50]
A comparison of features for synthetic speech detection,
M. Sahidullah, T. Kinnunen, and C. Hanilc ¸i, “A comparison of features for synthetic speech detection,” in Proc. Interspeech 2015 , 2015, pp. 2087–2091
work page 2015
-
[51]
A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection,
X. Wang and J. Yamagishi, “A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection,” in Proc. Interspeech 2021, 2021, pp. 4259–4263
work page 2021
-
[52]
Mesonet: a compact facial video forgery detection network,
D. Afchar, V . Nozick, J. Yamagishi, and I. Echizen, “Mesonet: a compact facial video forgery detection network,” in 2018 IEEE International Workshop on Information Forensics and Security (WIFS), 2018, pp. 1–7
work page 2018
-
[53]
Specrnet: Towards faster and more ac- cessible audio deepfake detection,
P. Kawa, M. Plata, and P. Syga, “Specrnet: Towards faster and more ac- cessible audio deepfake detection,” in 2022 IEEE International Confer- ence on Trust, Security and Privacy in Computing and Communications (TrustCom), 2022, pp. 792–799
work page 2022
-
[54]
Fake speech detection using residual network with transformer encoder,
Z. Zhang, X. Yi, and X. Zhao, “Fake speech detection using residual network with transformer encoder,” in Proceedings of the 2021 ACM workshop on information hiding and multimedia security , 2021, pp. 13– 22
work page 2021
-
[55]
H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” arXiv preprint arXiv:2202.12233, 2022
-
[56]
The vicomtech audio deepfake detection system based on wav2vec2 for the 2022 add challenge,
J. M. Mart ´ın-Do˜nas and A. ´Alvarez, “The vicomtech audio deepfake detection system based on wav2vec2 for the 2022 add challenge,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2022, pp. 9241–9245
work page 2022
-
[57]
Improved DeepFake Detection Using Whisper Features,
P. Kawa, M. Plata, M. Czuba, P. Szyma ´nski, and P. Syga, “Improved DeepFake Detection Using Whisper Features,” in Proc. INTERSPEECH 2023, 2023, pp. 4009–4013
work page 2023
-
[58]
Complex-valued neural networks for voice anti-spoofing,
N. M. M ¨uller, P. Sperl, and K. B ¨ottinger, “Complex-valued neural networks for voice anti-spoofing,” in Proc. INTERSPEECH 2023, 2023, pp. 3814–3818
work page 2023
-
[59]
Av-deepfake1m: A large-scale llm-driven audio-visual deepfake dataset,
“Av-deepfake1m: A large-scale llm-driven audio-visual deepfake dataset,” arXiv preprint arXiv:2311.15308 , 2023
- [60]
-
[61]
Signal estimation from modified short-time fourier transform,
D. Griffin and J. Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on acoustics, speech, and signal processing, vol. 32, no. 2, pp. 236–243, 1984
work page 1984
-
[62]
Scaling speech technology to 1,000+ languages,
V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi et al. , “Scaling speech technology to 1,000+ languages,” arXiv preprint arXiv:2305.13516 , 2023
-
[63]
Sam: Accenture non-binary voice,
Accenture, “Sam: Accenture non-binary voice,” https://github.com/ Sam-Accenture-Non-Binary-V oice/non-binary-voice-files#licensing, 2023, accessed on 01/02/2024
work page 2023
-
[64]
H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” in The Speaker and Language Recognition Workshop, 2022
work page 2022
-
[65]
A study on data augmentation of reverberant speech for robust speech recognition,
T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2017, pp. 5220–5224
work page 2017
-
[66]
ESC: Dataset for Environmental Sound Classification,
K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proceedings of the 23rd Annual ACM Conference on Multimedia . ACM Press, pp. 1015–1018. [Online]. Available: http://dl.acm.org/ citation.cfm?doid=2733373.2806390
-
[67]
Free music archive - instrumental,
F. M. Archive, “Free music archive - instrumental,” https://freemusicarchive.org/genre/Instrumental/, 2023, accessed on 01/02/2024
work page 2023
-
[68]
MUSAN: A Music, Speech, and Noise Corpus
D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[69]
Speech is silver, silence is golden: What do asvspoof- trained models really learn?
N. M. M ¨uller, F. Dieckmann, P. Czempin, R. Canals, K. B ¨ottinger, and J. Williams, “Speech is silver, silence is golden: What do asvspoof- trained models really learn?” arXiv preprint arXiv:2106.12914 , 2021
-
[70]
openai/whisper-large · hugging face,
“openai/whisper-large · hugging face,” https://huggingface.co/openai/ whisper-large, (Accessed on 01/08/2024)
work page 2024
-
[71]
Binary codes capable of correcting deletions, insertions, and reversals,
V . I. Levenshtein et al., “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady , vol. 10, no. 8. Soviet Union, 1966, pp. 707–710. APPENDIX architecture duration Azzurra-V oice 1.5 Bark 123.3 Capacitron 1.5 Chatterbox 14.5 E2 TTS 1.8 F5 TTS 1.7 FastPitch 2.1 FireRedTTS 5.5 FishTTS 7.4 GlowTTS 12.6 Griffin Li...
work page 1966
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.