MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

Edresson Casanova; Eren G\"olge; Konstantin B\"ottinger; Nicolas M. M\"uller; Philip Sperl; Piotr Kawa; Piotr Syga; Thorsten M\"uller; Wei Herng Choong

REVIEW 3 major objections 1 minor 5 cited by

MLAAD, built from 175 TTS systems across 54 languages, trains audio deepfake detectors that outperform those trained on prior datasets and complement ASVspoof 2019.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-24 04:32 UTC pith:W3ZDYWZB

load-bearing objection MLAAD releases a genuinely large multi-lingual anti-spoofing corpus, but the superiority claims over smaller datasets are likely driven by raw training volume rather than the claimed diversity. the 3 major comments →

arxiv 2401.09512 v10 pith:W3ZDYWZB submitted 2024-01-17 cs.SD eess.AS

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

Nicolas M. M\"uller , Piotr Kawa , Wei Herng Choong , Edresson Casanova , Eren G\"olge , Thorsten M\"uller , Piotr Syga , Philip Sperl

show 1 more author

Konstantin B\"ottinger

This is my paper

classification cs.SD eess.AS

keywords audio deepfake detectionanti-spoofing datasettext-to-speech synthesismulti-language audiosynthetic speechdeepfake detection modelsASVspoof comparison

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MLAAD as a large collection of synthetic audio for training and evaluating deepfake detectors. It draws on 175 distinct text-to-speech models to produce 1002.9 hours of voice data spanning 54 languages. Detectors trained on this resource show stronger results than those trained on smaller or less varied sets such as InTheWild and FakeOrReal. Across eight test datasets, MLAAD and the established ASVspoof 2019 dataset each lead on four of them.

Core claim

MLAAD supplies 1002.9 hours of synthetic audio generated by 175 TTS models in 54 languages. Training three state-of-the-art detectors on MLAAD produces better performance than training on InTheWild or FakeOrReal. In head-to-head tests across eight evaluation sets, MLAAD and ASVspoof 2019 each outperform the other on exactly four datasets. The authors release both the full dataset and a trained model through a public webserver.

What carries the argument

The MLAAD dataset itself, assembled from the outputs of 175 separate TTS models.

Load-bearing premise

The synthetic voices produced by these 175 TTS models capture the range of features found in real-world deepfake audio.

What would settle it

A detection model trained only on MLAAD shows markedly lower accuracy on deepfakes generated by a new TTS system that was never used to create any of the dataset's samples.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Detectors trained on MLAAD will identify a broader set of synthetic voices than detectors trained on narrower existing collections.
Pairing MLAAD with ASVspoof 2019 supplies coverage that neither resource achieves alone.
The 54-language scope supports training detectors that remain effective outside English-dominant test conditions.
Public release of the dataset and webserver model allows non-specialists to build and test anti-spoofing tools directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future TTS systems that differ substantially from the 175 models used here could still evade detectors trained on MLAAD.
The same multi-source generation strategy might be applied to other modalities such as video to create similarly diverse training sets.
Performance differences across the 54 languages could guide targeted collection of additional data for under-performing languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

MLAAD releases a genuinely large multi-lingual anti-spoofing corpus, but the superiority claims over smaller datasets are likely driven by raw training volume rather than the claimed diversity.

read the letter

The main contribution here is the dataset itself: 1002.9 hours of synthetic speech from 175 TTS systems across 54 languages. That scale and coverage is new and fills a clear gap for anyone trying to move beyond English-centric training data. Releasing both the data and a web demo is the right move for a resource paper and should get some uptake in the anti-spoofing community. The complementarity result with ASVspoof 2019 across eight test sets is also worth noting; it suggests the two resources are not redundant. The paper does not overclaim novelty in the TTS methods themselves, which keeps the focus honest. The soft spot is the comparison to InTheWild and FakeOrReal. Those sets are much smaller, and the abstract gives no sign that the authors equalized total training hours or utterance counts before claiming MLAAD is superior. Without that control, the performance gap is confounded by quantity. The same issue applies to the lack of any reported training details, test-set construction, or significance tests. Readers will need the full experimental protocol to judge whether the multi-lingual aspect or the sheer volume is doing the work. This is a standard dataset release with empirical benchmarking. It is useful for groups already working on audio deepfake detection who need more non-English material and are willing to run their own controls. It is not a methods paper, so the bar is whether the release is clean and the scale is real. The work is coherent on its own terms and shows clear engagement with the existing benchmarks. I would send it to peer review with a request for data-volume-matched ablations and more methodological detail; the dataset itself is worth referee time even if the current claims need tightening.

Referee Report

3 major / 1 minor

Summary. The manuscript presents the Multi-Language Audio Anti-Spoofing Dataset (MLAAD) version 10, comprising 1002.9 hours of synthetic audio from 175 TTS models across 54 languages. It claims that models trained on MLAAD achieve superior performance to those trained on InTheWild and FakeOrReal, and that MLAAD is complementary to ASVspoof 2019, with each resource outperforming the other on four of eight external test datasets.

Significance. Release of a large-scale multi-lingual synthetic audio corpus with 175 distinct TTS systems could provide a useful training resource for audio deepfake detection, particularly if the complementarity result with ASVspoof 2019 holds under controlled conditions. The accompanying webserver model further supports accessibility. The empirical superiority claims, however, rest on comparisons whose validity depends on addressing data-volume confounds and providing missing experimental details.

major comments (3)

[Abstract] Abstract: the reported superiority of MLAAD over InTheWild and FakeOrReal as a training resource does not indicate whether training-set sizes (in hours or utterances) were matched across conditions. MLAAD's 1002.9 h is an order of magnitude larger than typical comparables; without subsampling or explicit volume controls, performance differences cannot be attributed to the 175 TTS models or 54-language coverage rather than data quantity.
[Abstract (evaluation paragraph)] Abstract (evaluation paragraph): the central empirical claims lack any description of training hyperparameters, exact test-set construction protocols, or statistical significance testing, leaving the performance gains and the four/four alternation result only partially supported by the reported evidence.
[Abstract] Abstract: the complementarity claim with ASVspoof 2019 across eight datasets requires explicit enumeration of those eight test collections, the precise evaluation metrics, and confirmation that identical model architectures and training schedules were used in all head-to-head comparisons.

minor comments (1)

The manuscript should clarify the rationale for version numbering (v10) and any changes relative to earlier releases of the same corpus.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to strengthen the presentation of our empirical claims.

read point-by-point responses

Referee: [Abstract] Abstract: the reported superiority of MLAAD over InTheWild and FakeOrReal as a training resource does not indicate whether training-set sizes (in hours or utterances) were matched across conditions. MLAAD's 1002.9 h is an order of magnitude larger than typical comparables; without subsampling or explicit volume controls, performance differences cannot be attributed to the 175 TTS models or 54-language coverage rather than data quantity.

Authors: We agree that the abstract does not indicate whether training-set sizes were matched. Our experiments used the full MLAAD (1002.9 hours) against the naturally smaller InTheWild and FakeOrReal corpora. While dataset volume is a contributing factor, the diversity across 175 TTS systems and 54 languages is intended to be the primary contribution. In revision we will explicitly report the sizes employed in each comparison and add a controlled experiment using a volume-matched subsample of MLAAD. revision: yes
Referee: [Abstract (evaluation paragraph)] Abstract (evaluation paragraph): the central empirical claims lack any description of training hyperparameters, exact test-set construction protocols, or statistical significance testing, leaving the performance gains and the four/four alternation result only partially supported by the reported evidence.

Authors: The abstract is space-constrained, yet we accept that the evaluation paragraph should reference these elements. The full manuscript contains the training hyperparameters, test-set construction details, and evaluation protocol in the experimental section. We will revise the abstract to include a concise reference to these aspects and ensure statistical significance testing (e.g., via bootstrap resampling) is explicitly described. revision: yes
Referee: [Abstract] Abstract: the complementarity claim with ASVspoof 2019 across eight datasets requires explicit enumeration of those eight test collections, the precise evaluation metrics, and confirmation that identical model architectures and training schedules were used in all head-to-head comparisons.

Authors: We agree that the abstract should enumerate the eight test collections, state the metrics (equal error rate), and confirm identical architectures and schedules. In the revision we will add this enumeration, either in the abstract or via a new table, and explicitly note that the same three detection models and training procedures were applied in all head-to-head comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation is externally grounded

full rationale

The paper constructs MLAAD and reports performance by training standard detectors on it then evaluating on eight external test collections (and comparing against models trained on InTheWild, FakeOrReal, ASVspoof 2019). These are direct experimental outcomes with no equations, fitted parameters, or self-definitions that reduce the reported metrics to quantities defined inside the paper. No self-citation chains or ansatzes are invoked to justify the central claims; the results remain falsifiable against the released dataset and external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset release paper. No mathematical derivations, free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5757 in / 1141 out tokens · 23411 ms · 2026-05-24T04:32:38.717856+00:00 · methodology

0 comments

read the original abstract

This paper presents the Multi-Language Audio Anti-Spoofing Dataset (MLAAD), version 10: a dataset of synthetic audio to train and evaluate audio deepfake detection models. It features 175 Text-to-Speech (TTS) models, comprising a total of 1002.9 hours of synthetic voice in 54 different languages. To evaluate this dataset, we train three state-of-the-art deepfake detection models with MLAAD and observe that it demonstrates superior performance to comparable datasets like InTheWild and FakeOrReal when used as a training resource. Moreover, compared to the renowned ASVspoof 2019 dataset, MLAAD proves to be a complementary resource. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, each excelling on four datasets. By publishing the dataset and making a trained model accessible via an interactive webserver, we aim to democratize anti-spoofing technology, making it accessible beyond the realm of specialists, and contributing to global efforts against audio spoofing and deepfakes.

Figures

Figures reproduced from arXiv: 2401.09512 by Edresson Casanova, Eren G\"olge, Konstantin B\"ottinger, Nicolas M. M\"uller, Philip Sperl, Piotr Kawa, Piotr Syga, Thorsten M\"uller, Wei Herng Choong.

**Figure 2.** Figure 2: Composition of MLAAD with respect to TTS model architecture (top), training dataset (middle) and language distribution [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When EER Hides Deployment Failure: Auditing Threshold Transfer and Unlabeled Score Calibration for Speech Deepfake Detectors
cs.SD 2026-06 accept novelty 7.0

Speech deepfake detectors achieve low EER on labeled test sets but suffer high half total error rates at transferred thresholds on unlabeled data, and common score calibrations leave EER essentially unchanged.
Ethical and Technical Limits of Deepfake Speech Datasets
cs.SD 2026-06 unverdicted novelty 6.0

Audit of 39 deepfake speech datasets shows most lack demographic metadata making fairness checks infeasible and reveals substantial overlap in bona fide sources that undermines cross-dataset generalization claims.
Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative
cs.SD 2026-03 accept novelty 6.0

RuASD is a comprehensive Russian speech anti-spoofing dataset featuring 37 synthesis systems and a robustness evaluation pipeline for real-world channel distortions.
The Hidden Cost of Pairwise Verification in Synthetic Speech Source Tracing
cs.SD 2026-06 conditional novelty 5.0

Global anchoring outperforms pairwise verification in synthetic speech source tracing by preserving more discriminative embedding directions, yielding lower error rates on in-domain and out-of-domain data.
Anchoring the Unknown: Open-Set Model Attribution via Proxy-Anchor Learning
eess.AS 2026-06 unverdicted novelty 5.0

Proxy-Anchor metric learning on Wav2Vec2-BERT embeddings with architecture merging achieves 99.76% closed-set accuracy and 2.04% FPR@95 OOD detection on MLAAD v9, doubling prior OOD accuracy on v5 splits.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 5 Pith papers · 1 internal anchor

[1]

Par- rotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation,

F. Biadsy, R. J. Weiss, P. J. Moreno, D. Kanvesky, and Y . Jia, “Par- rotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation,” in Proc. Interspeech 2019 , 2019, pp. 4115–4119

work page 2019
[2]

Apple introduces new features for cognitive accessibility, along with live speech, personal voice, and point and speak in magnifier,

Apple, “Apple introduces new features for cognitive accessibility, along with live speech, personal voice, and point and speak in magnifier,” https://www.apple.com/newsroom/2023/05/ apple-previews-live-speech-personal, 2023, accessed on 01/02/2024

work page 2023
[3]

Fraudsters cloned company director’s voice in $35 million bank heist, police find,

Forbes, “Fraudsters cloned company director’s voice in $35 million bank heist, police find,” https://www.forbes.com/sites/thomasbrewster/2021/ 10/14/huge-bank-fraud-uses-deep-fake-voice-tech-to-steal-millions, 2023, accessed on 01/02/2024

work page 2021
[4]

Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,

X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V . Vestman, T. Kinnunen, K. A. Lee et al. , “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” Computer Speech & Language , vol. 64, p. 101114, 2020

work page 2019
[5]

Content authenticity initiative,

“Content authenticity initiative,” https://contentauthenticity.org/, (Ac- cessed on 01/02/2024)

work page 2024
[6]

Does Audio Deepfake Detection Generalize?

N. M ¨uller, P. Czempin, F. Diekmann, A. Froghyar, and K. B ¨ottinger, “Does Audio Deepfake Detection Generalize?” in Proc. Interspeech 2022, 2022, pp. 2783–2787

work page 2022
[7]

”ai putin

T. Sun, “”ai putin” gave his chilling new year message, viewers convinced after ‘telltale sign’ spotted as death rumours swirl — the sun,” https://www.thesun.co.uk/news/25220488/ ai-putin-chilling-message-death-rumours/, (Accessed on 01/02/2024)

work page arXiv 2024
[8]

Deepfake elections: How indian politicians are using ai-manipulated media to malign,

O. India, “Deepfake elections: How indian politicians are using ai-manipulated media to malign,” https://business.outlookindia.com/ technology/deepfake-elections-how-indian-politicians-are-using-ai, (Accessed on 01/02/2024)

work page 2024
[9]

Intelligence artificielle: quand un deepfake d’emmanuel macron emballe la toile,

L. Croix, “Intelligence artificielle: quand un deepfake d’emmanuel macron emballe la toile,” https://www.la-croix.com/France/ Intelligence-artificielle-quand-deepfake-dEmmanuel-Macron, (Accessed on 01/02/2024)

work page 2024
[10]

The m-ailabs speech dataset,

T. M.-A. S. Dataset, “The m-ailabs speech dataset,” https://www. caito.de/2019/01/03/the-m-ailabs-speech-dataset/, 2023, accessed on 01/02/2024

work page 2019
[11]

Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,

Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilc ¸i, M. Sahidullah, and A. Sizov, “Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,” in Sixteenth annual confer- ence of the international speech communication association , 2015

work page 2015
[12]

ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection,

J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans et al., “Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,” arXiv preprint arXiv:2109.00537 , 2021

work page arXiv 2021
[13]

FakeA VCeleb: A novel audio-video multimodal deepfake dataset,

H. Khalid, S. Tariq, M. Kim, and S. S. Woo, “FakeA VCeleb: A novel audio-video multimodal deepfake dataset,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , 2021. [Online]. Available: https://openreview.net/forum?id=TAXFsg6ZaOl

work page 2021
[14]

For: A dataset for synthetic speech detection,

R. Reimao and V . Tzerpos, “For: A dataset for synthetic speech detection,” in 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) . IEEE, 2019, pp. 1–10

work page 2019
[15]

Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,

X. Wang and J. Yamagishi, “Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–5

work page 2023
[16]

The partialspoof database and countermeasures for the detection of short fake speech segments embedded in an utterance,

L. Zhang, X. Wang, E. Cooper, N. Evans, and J. Yamagishi, “The partialspoof database and countermeasures for the detection of short fake speech segments embedded in an utterance,”IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 31, pp. 813–825, 2023

work page 2023
[17]

WaveFake: A Data Set to Facilitate Audio Deepfake Detection,

J. Frank and L. Sch ¨onherr, “WaveFake: A Data Set to Facilitate Audio Deepfake Detection,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2021

work page 2021
[18]

Add 2022: the first audio deep synthesis detection challenge,

J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y . Bai, C. Fanet al., “Add 2022: the first audio deep synthesis detection challenge,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2022, pp. 9216–9220

work page 2022
[19]

Add 2023: the second audio deepfake detection challenge,

J. Yi, J. Tao, R. Fu, X. Yan, C. Wang, T. Wang, C. Y . Zhang, X. Zhang, Y . Zhao, Y . Renet al., “Add 2023: the second audio deepfake detection challenge,” arXiv preprint arXiv:2305.13774 , 2023

work page arXiv 2023
[20]

Fmfcc-a: a challenging mandarin dataset for synthetic speech detection,

Z. Zhang, Y . Gu, X. Yi, and X. Zhao, “Fmfcc-a: a challenging mandarin dataset for synthetic speech detection,” in International Workshop on Digital Watermarking. Springer, 2021, pp. 117–131

work page 2021
[21]

Half- Truth: A Partially Fake Audio Detection Dataset,

J. Yi, Y . Bai, J. Tao, H. Ma, Z. Tian, C. Wang, T. Wang, and R. Fu, “Half- Truth: A Partially Fake Audio Detection Dataset,” in Proc. Interspeech 2021, 2021, pp. 1654–1658

work page 2021
[22]

CFAD: A Chinese dataset for fake audio detection,

H. Ma, J. Yi, C. Wang, X. Yan, J. Tao, T. Wang, S. Wang, L. Xu, and R. Fu, “Fad: A chinese dataset for fake audio detection,” arXiv preprint arXiv:2207.12308, 2022

work page arXiv 2022
[23]

Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerrv-Ryan et al. , “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 4779–4783

work page 2018
[24]

SpeedySpeech: Efficient Neural Speech Syn- thesis,

J. Vainer and O. Du ˇsek, “SpeedySpeech: Efficient Neural Speech Syn- thesis,” in Proc. Interspeech 2020 , 2020, pp. 3575–3579

work page 2020
[25]

Better speech synthesis through scaling, 2023

J. Betker, “Better speech synthesis through scaling,” arXiv preprint arXiv:2305.07243, 2023

work page arXiv 2023
[26]

Xtts: Open model release announcement,

Coqui.ai, “Xtts: Open model release announcement,” https://coqui.ai/ blog/tts/open xtts, (Accessed on 01/02/2024)

work page 2024
[27]

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,

J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning . PMLR, 2021, pp. 5530–5540

work page 2021
[28]

Fastpitch: Parallel text-to-speech with pitch prediction,

A. Ła ´ncucki, “Fastpitch: Parallel text-to-speech with pitch prediction,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2021, pp. 6588–6592

work page 2021
[29]

Effective use of variational embedding capacity in expressive end-to-end speech synthesis,

E. Battenberg, S. Mariooryad, D. Stanton, R. Skerry-Ryan, M. Shannon, D. Kao, and T. Bagby, “Effective use of variational embedding capacity in expressive end-to-end speech synthesis,” 2019

work page 2019
[30]

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for everyone,

E. Casanova, J. Weber, C. Shulby, A. C. Junior, E. G ¨olge, and M. Antonelli Ponti, “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for everyone,” arXiv e-prints , p. arXiv:2112.02418, Dec. 2021

work page arXiv 2021
[31]

Glow-tts: A generative flow for text-to-speech via monotonic alignment search,

J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,” Advances in Neural Information Processing Systems , vol. 33, pp. 8067–8077, 2020

work page 2020
[32]

OverFlow: Putting flows on top of neural transducers for better TTS,

S. Mehta, A. Kirkland, H. Lameris, J. Beskow, ´Eva Sz ´ekely, and G. E. Henter, “OverFlow: Putting flows on top of neural transducers for better TTS,” in Proc. INTERSPEECH 2023 , 2023, pp. 4279–4283

work page 2023
[33]

StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding V oice Conversion,

Y . A. Li, A. Zare, and N. Mesgarani, “StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding V oice Conversion,” in Proc. Interspeech 2021 , 2021, pp. 1349–1353

work page 2021
[34]

Freevc: Towards high-quality text-free one-shot voice conversion,

J. Li, W. Tu, and L. Xiao, “Freevc: Towards high-quality text-free one-shot voice conversion,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–5

work page 2023
[35]

ESPnet: End-to-end speech processing toolkit,

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proceedings of Interspeech, 2018, pp. 2207–2211. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1456

work page doi:10.21437/interspeech.2018-1456 2018
[36]

Coqui TTS,

G. Eren and The Coqui TTS Team, “Coqui TTS,” Jan. 2021. [Online]. Available: https://github.com/coqui-ai/TTS

work page 2021
[37]

SpeechBrain

M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y . Gao, R. De Mori, and Y . Bengio, “SpeechBrain.” [Online]. Available: https://github.com/speechbrain/speechbrain/

work page
[38]

Transformers: State-of-the-Art Natural Language Processing

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, C. Ma, Y . Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: State-of-the-Art Natural Language Processing.” Association for Computational Linguistics, Oct. 2020, pp. 38–45. [Online]. Available: https://www.aclweb.org/anthology/2020.em...

work page 2020
[39]

The lj speech dataset,

K. Ito and L. Johnson, “The lj speech dataset,” https://keithito.com/ LJ-Speech-Dataset/, 2017

work page 2017
[40]

The blizzard challenge 2013,

S. King and V . Karaiskos, “The blizzard challenge 2013,” 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:166265879

work page 2013
[41]

CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages,

K. Park and T. Mulc, “CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages,” in Proc. Interspeech 2019, 2019, pp. 1566– 1570

work page 2019
[42]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of The 12th Language Resources and Evaluation Conference . Marseille, France: European Language Resources Association, May 2020, pp. 4218–4222. [Online]. Available: https:...

work page 2020
[43]

Thorsten-voice dataset 2022.10,

T. M ¨uller and D. Kreutz, “Thorsten-voice dataset 2022.10,” October

work page 2022
[44]

Available: https://doi.org/10.5281/zenodo.7265581

[Online]. Available: https://doi.org/10.5281/zenodo.7265581

work page doi:10.5281/zenodo.7265581
[45]

Github - dioco-group/jenny-tts-dataset: A high-quality, varied ˜30hr voice dataset suitable for training a tts model,

“Github - dioco-group/jenny-tts-dataset: A high-quality, varied ˜30hr voice dataset suitable for training a tts model,” https://github.com/ dioco-group/jenny-tts-dataset, (Accessed on 01/02/2024)

work page 2024
[46]

End-to-end anti-spoofing with rawnet2,

H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with rawnet2,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6369–6373

work page 2021
[47]

End-to-end spectro-temporal graph attention networks for speaker ver- ification anti-spoofing and speech deepfake detection,

H. Tak, J.-w. Jung, J. Patino, M. Kamble, M. Todisco, and N. Evans, “End-to-end spectro-temporal graph attention networks for speaker ver- ification anti-spoofing and speech deepfake detection,” arXiv preprint arXiv:2107.12710, 2021

work page arXiv 2021
[48]

Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,

J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6367–6371

work page 2022
[49]

Raw differentiable architecture search for speech deepfake and spoofing detection,

W. Ge, J. Patino, M. Todisco, and N. Evans, “Raw differentiable architecture search for speech deepfake and spoofing detection,” arXiv preprint arXiv:2107.12212, 2021

work page arXiv 2021
[50]

A comparison of features for synthetic speech detection,

M. Sahidullah, T. Kinnunen, and C. Hanilc ¸i, “A comparison of features for synthetic speech detection,” in Proc. Interspeech 2015 , 2015, pp. 2087–2091

work page 2015
[51]

A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection,

X. Wang and J. Yamagishi, “A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection,” in Proc. Interspeech 2021, 2021, pp. 4259–4263

work page 2021
[52]

Mesonet: a compact facial video forgery detection network,

D. Afchar, V . Nozick, J. Yamagishi, and I. Echizen, “Mesonet: a compact facial video forgery detection network,” in 2018 IEEE International Workshop on Information Forensics and Security (WIFS), 2018, pp. 1–7

work page 2018
[53]

Specrnet: Towards faster and more ac- cessible audio deepfake detection,

P. Kawa, M. Plata, and P. Syga, “Specrnet: Towards faster and more ac- cessible audio deepfake detection,” in 2022 IEEE International Confer- ence on Trust, Security and Privacy in Computing and Communications (TrustCom), 2022, pp. 792–799

work page 2022
[54]

Fake speech detection using residual network with transformer encoder,

Z. Zhang, X. Yi, and X. Zhao, “Fake speech detection using residual network with transformer encoder,” in Proceedings of the 2021 ACM workshop on information hiding and multimedia security , 2021, pp. 13– 22

work page 2021
[55]

Automatic speaker verification spoofing and deep- fake detection using wav2vec 2.0 and data augmentation,

H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” arXiv preprint arXiv:2202.12233, 2022

work page arXiv 2022
[56]

The vicomtech audio deepfake detection system based on wav2vec2 for the 2022 add challenge,

J. M. Mart ´ın-Do˜nas and A. ´Alvarez, “The vicomtech audio deepfake detection system based on wav2vec2 for the 2022 add challenge,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2022, pp. 9241–9245

work page 2022
[57]

Improved DeepFake Detection Using Whisper Features,

P. Kawa, M. Plata, M. Czuba, P. Szyma ´nski, and P. Syga, “Improved DeepFake Detection Using Whisper Features,” in Proc. INTERSPEECH 2023, 2023, pp. 4009–4013

work page 2023
[58]

Complex-valued neural networks for voice anti-spoofing,

N. M. M ¨uller, P. Sperl, and K. B ¨ottinger, “Complex-valued neural networks for voice anti-spoofing,” in Proc. INTERSPEECH 2023, 2023, pp. 3814–3818

work page 2023
[59]

Av-deepfake1m: A large-scale llm-driven audio-visual deepfake dataset,

“Av-deepfake1m: A large-scale llm-driven audio-visual deepfake dataset,” arXiv preprint arXiv:2311.15308 , 2023

work page arXiv 2023
[60]

nmt · pypi,

“nmt · pypi,” https://pypi.org/project/nmt/, (Accessed on 01/15/2024)

work page 2024
[61]

Signal estimation from modified short-time fourier transform,

D. Griffin and J. Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on acoustics, speech, and signal processing, vol. 32, no. 2, pp. 236–243, 1984

work page 1984
[62]

Scaling speech technology to 1,000+ languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi et al. , “Scaling speech technology to 1,000+ languages,” arXiv preprint arXiv:2305.13516 , 2023

work page arXiv 2023
[63]

Sam: Accenture non-binary voice,

Accenture, “Sam: Accenture non-binary voice,” https://github.com/ Sam-Accenture-Non-Binary-V oice/non-binary-voice-files#licensing, 2023, accessed on 01/02/2024

work page 2023
[64]

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,

H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” in The Speaker and Language Recognition Workshop, 2022

work page 2022
[65]

A study on data augmentation of reverberant speech for robust speech recognition,

T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2017, pp. 5220–5224

work page 2017
[66]

ESC: Dataset for Environmental Sound Classi- fication,

K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proceedings of the 23rd Annual ACM Conference on Multimedia . ACM Press, pp. 1015–1018. [Online]. Available: http://dl.acm.org/ citation.cfm?doid=2733373.2806390

work page arXiv
[67]

Free music archive - instrumental,

F. M. Archive, “Free music archive - instrumental,” https://freemusicarchive.org/genre/Instrumental/, 2023, accessed on 01/02/2024

work page 2023
[68]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1

work page internal anchor Pith review Pith/arXiv arXiv 2015
[69]

Speech is silver, silence is golden: What do ASVspoof-trained models really learn?

N. M. M ¨uller, F. Dieckmann, P. Czempin, R. Canals, K. B ¨ottinger, and J. Williams, “Speech is silver, silence is golden: What do asvspoof- trained models really learn?” arXiv preprint arXiv:2106.12914 , 2021

work page arXiv 2021
[70]

openai/whisper-large · hugging face,

“openai/whisper-large · hugging face,” https://huggingface.co/openai/ whisper-large, (Accessed on 01/08/2024)

work page 2024
[71]

Binary codes capable of correcting deletions, insertions, and reversals,

V . I. Levenshtein et al., “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady , vol. 10, no. 8. Soviet Union, 1966, pp. 707–710. APPENDIX architecture duration Azzurra-V oice 1.5 Bark 123.3 Capacitron 1.5 Chatterbox 14.5 E2 TTS 1.8 F5 TTS 1.7 FastPitch 2.1 FireRedTTS 5.5 FishTTS 7.4 GlowTTS 12.6 Griffin Li...

work page 1966

[1] [1]

Par- rotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation,

F. Biadsy, R. J. Weiss, P. J. Moreno, D. Kanvesky, and Y . Jia, “Par- rotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation,” in Proc. Interspeech 2019 , 2019, pp. 4115–4119

work page 2019

[2] [2]

Apple introduces new features for cognitive accessibility, along with live speech, personal voice, and point and speak in magnifier,

Apple, “Apple introduces new features for cognitive accessibility, along with live speech, personal voice, and point and speak in magnifier,” https://www.apple.com/newsroom/2023/05/ apple-previews-live-speech-personal, 2023, accessed on 01/02/2024

work page 2023

[3] [3]

Fraudsters cloned company director’s voice in $35 million bank heist, police find,

Forbes, “Fraudsters cloned company director’s voice in $35 million bank heist, police find,” https://www.forbes.com/sites/thomasbrewster/2021/ 10/14/huge-bank-fraud-uses-deep-fake-voice-tech-to-steal-millions, 2023, accessed on 01/02/2024

work page 2021

[4] [4]

Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,

X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V . Vestman, T. Kinnunen, K. A. Lee et al. , “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” Computer Speech & Language , vol. 64, p. 101114, 2020

work page 2019

[5] [5]

Content authenticity initiative,

“Content authenticity initiative,” https://contentauthenticity.org/, (Ac- cessed on 01/02/2024)

work page 2024

[6] [6]

Does Audio Deepfake Detection Generalize?

N. M ¨uller, P. Czempin, F. Diekmann, A. Froghyar, and K. B ¨ottinger, “Does Audio Deepfake Detection Generalize?” in Proc. Interspeech 2022, 2022, pp. 2783–2787

work page 2022

[7] [7]

”ai putin

T. Sun, “”ai putin” gave his chilling new year message, viewers convinced after ‘telltale sign’ spotted as death rumours swirl — the sun,” https://www.thesun.co.uk/news/25220488/ ai-putin-chilling-message-death-rumours/, (Accessed on 01/02/2024)

work page arXiv 2024

[8] [8]

Deepfake elections: How indian politicians are using ai-manipulated media to malign,

O. India, “Deepfake elections: How indian politicians are using ai-manipulated media to malign,” https://business.outlookindia.com/ technology/deepfake-elections-how-indian-politicians-are-using-ai, (Accessed on 01/02/2024)

work page 2024

[9] [9]

Intelligence artificielle: quand un deepfake d’emmanuel macron emballe la toile,

L. Croix, “Intelligence artificielle: quand un deepfake d’emmanuel macron emballe la toile,” https://www.la-croix.com/France/ Intelligence-artificielle-quand-deepfake-dEmmanuel-Macron, (Accessed on 01/02/2024)

work page 2024

[10] [10]

The m-ailabs speech dataset,

T. M.-A. S. Dataset, “The m-ailabs speech dataset,” https://www. caito.de/2019/01/03/the-m-ailabs-speech-dataset/, 2023, accessed on 01/02/2024

work page 2019

[11] [11]

Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,

Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilc ¸i, M. Sahidullah, and A. Sizov, “Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,” in Sixteenth annual confer- ence of the international speech communication association , 2015

work page 2015

[12] [12]

ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection,

J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans et al., “Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,” arXiv preprint arXiv:2109.00537 , 2021

work page arXiv 2021

[13] [13]

FakeA VCeleb: A novel audio-video multimodal deepfake dataset,

H. Khalid, S. Tariq, M. Kim, and S. S. Woo, “FakeA VCeleb: A novel audio-video multimodal deepfake dataset,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , 2021. [Online]. Available: https://openreview.net/forum?id=TAXFsg6ZaOl

work page 2021

[14] [14]

For: A dataset for synthetic speech detection,

R. Reimao and V . Tzerpos, “For: A dataset for synthetic speech detection,” in 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) . IEEE, 2019, pp. 1–10

work page 2019

[15] [15]

Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,

X. Wang and J. Yamagishi, “Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–5

work page 2023

[16] [16]

The partialspoof database and countermeasures for the detection of short fake speech segments embedded in an utterance,

L. Zhang, X. Wang, E. Cooper, N. Evans, and J. Yamagishi, “The partialspoof database and countermeasures for the detection of short fake speech segments embedded in an utterance,”IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 31, pp. 813–825, 2023

work page 2023

[17] [17]

WaveFake: A Data Set to Facilitate Audio Deepfake Detection,

J. Frank and L. Sch ¨onherr, “WaveFake: A Data Set to Facilitate Audio Deepfake Detection,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2021

work page 2021

[18] [18]

Add 2022: the first audio deep synthesis detection challenge,

J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y . Bai, C. Fanet al., “Add 2022: the first audio deep synthesis detection challenge,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2022, pp. 9216–9220

work page 2022

[19] [19]

Add 2023: the second audio deepfake detection challenge,

J. Yi, J. Tao, R. Fu, X. Yan, C. Wang, T. Wang, C. Y . Zhang, X. Zhang, Y . Zhao, Y . Renet al., “Add 2023: the second audio deepfake detection challenge,” arXiv preprint arXiv:2305.13774 , 2023

work page arXiv 2023

[20] [20]

Fmfcc-a: a challenging mandarin dataset for synthetic speech detection,

Z. Zhang, Y . Gu, X. Yi, and X. Zhao, “Fmfcc-a: a challenging mandarin dataset for synthetic speech detection,” in International Workshop on Digital Watermarking. Springer, 2021, pp. 117–131

work page 2021

[21] [21]

Half- Truth: A Partially Fake Audio Detection Dataset,

J. Yi, Y . Bai, J. Tao, H. Ma, Z. Tian, C. Wang, T. Wang, and R. Fu, “Half- Truth: A Partially Fake Audio Detection Dataset,” in Proc. Interspeech 2021, 2021, pp. 1654–1658

work page 2021

[22] [22]

CFAD: A Chinese dataset for fake audio detection,

H. Ma, J. Yi, C. Wang, X. Yan, J. Tao, T. Wang, S. Wang, L. Xu, and R. Fu, “Fad: A chinese dataset for fake audio detection,” arXiv preprint arXiv:2207.12308, 2022

work page arXiv 2022

[23] [23]

Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerrv-Ryan et al. , “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 4779–4783

work page 2018

[24] [24]

SpeedySpeech: Efficient Neural Speech Syn- thesis,

J. Vainer and O. Du ˇsek, “SpeedySpeech: Efficient Neural Speech Syn- thesis,” in Proc. Interspeech 2020 , 2020, pp. 3575–3579

work page 2020

[25] [25]

Better speech synthesis through scaling, 2023

J. Betker, “Better speech synthesis through scaling,” arXiv preprint arXiv:2305.07243, 2023

work page arXiv 2023

[26] [26]

Xtts: Open model release announcement,

Coqui.ai, “Xtts: Open model release announcement,” https://coqui.ai/ blog/tts/open xtts, (Accessed on 01/02/2024)

work page 2024

[27] [27]

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,

J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning . PMLR, 2021, pp. 5530–5540

work page 2021

[28] [28]

Fastpitch: Parallel text-to-speech with pitch prediction,

A. Ła ´ncucki, “Fastpitch: Parallel text-to-speech with pitch prediction,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2021, pp. 6588–6592

work page 2021

[29] [29]

Effective use of variational embedding capacity in expressive end-to-end speech synthesis,

E. Battenberg, S. Mariooryad, D. Stanton, R. Skerry-Ryan, M. Shannon, D. Kao, and T. Bagby, “Effective use of variational embedding capacity in expressive end-to-end speech synthesis,” 2019

work page 2019

[30] [30]

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for everyone,

E. Casanova, J. Weber, C. Shulby, A. C. Junior, E. G ¨olge, and M. Antonelli Ponti, “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for everyone,” arXiv e-prints , p. arXiv:2112.02418, Dec. 2021

work page arXiv 2021

[31] [31]

Glow-tts: A generative flow for text-to-speech via monotonic alignment search,

J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,” Advances in Neural Information Processing Systems , vol. 33, pp. 8067–8077, 2020

work page 2020

[32] [32]

OverFlow: Putting flows on top of neural transducers for better TTS,

S. Mehta, A. Kirkland, H. Lameris, J. Beskow, ´Eva Sz ´ekely, and G. E. Henter, “OverFlow: Putting flows on top of neural transducers for better TTS,” in Proc. INTERSPEECH 2023 , 2023, pp. 4279–4283

work page 2023

[33] [33]

StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding V oice Conversion,

Y . A. Li, A. Zare, and N. Mesgarani, “StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding V oice Conversion,” in Proc. Interspeech 2021 , 2021, pp. 1349–1353

work page 2021

[34] [34]

Freevc: Towards high-quality text-free one-shot voice conversion,

J. Li, W. Tu, and L. Xiao, “Freevc: Towards high-quality text-free one-shot voice conversion,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–5

work page 2023

[35] [35]

ESPnet: End-to-end speech processing toolkit,

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proceedings of Interspeech, 2018, pp. 2207–2211. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1456

work page doi:10.21437/interspeech.2018-1456 2018

[36] [36]

Coqui TTS,

G. Eren and The Coqui TTS Team, “Coqui TTS,” Jan. 2021. [Online]. Available: https://github.com/coqui-ai/TTS

work page 2021

[37] [37]

SpeechBrain

M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y . Gao, R. De Mori, and Y . Bengio, “SpeechBrain.” [Online]. Available: https://github.com/speechbrain/speechbrain/

work page

[38] [38]

Transformers: State-of-the-Art Natural Language Processing

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, C. Ma, Y . Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: State-of-the-Art Natural Language Processing.” Association for Computational Linguistics, Oct. 2020, pp. 38–45. [Online]. Available: https://www.aclweb.org/anthology/2020.em...

work page 2020

[39] [39]

The lj speech dataset,

K. Ito and L. Johnson, “The lj speech dataset,” https://keithito.com/ LJ-Speech-Dataset/, 2017

work page 2017

[40] [40]

The blizzard challenge 2013,

S. King and V . Karaiskos, “The blizzard challenge 2013,” 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:166265879

work page 2013

[41] [41]

CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages,

K. Park and T. Mulc, “CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages,” in Proc. Interspeech 2019, 2019, pp. 1566– 1570

work page 2019

[42] [42]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of The 12th Language Resources and Evaluation Conference . Marseille, France: European Language Resources Association, May 2020, pp. 4218–4222. [Online]. Available: https:...

work page 2020

[43] [43]

Thorsten-voice dataset 2022.10,

T. M ¨uller and D. Kreutz, “Thorsten-voice dataset 2022.10,” October

work page 2022

[44] [44]

Available: https://doi.org/10.5281/zenodo.7265581

[Online]. Available: https://doi.org/10.5281/zenodo.7265581

work page doi:10.5281/zenodo.7265581

[45] [45]

Github - dioco-group/jenny-tts-dataset: A high-quality, varied ˜30hr voice dataset suitable for training a tts model,

“Github - dioco-group/jenny-tts-dataset: A high-quality, varied ˜30hr voice dataset suitable for training a tts model,” https://github.com/ dioco-group/jenny-tts-dataset, (Accessed on 01/02/2024)

work page 2024

[46] [46]

End-to-end anti-spoofing with rawnet2,

H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with rawnet2,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6369–6373

work page 2021

[47] [47]

End-to-end spectro-temporal graph attention networks for speaker ver- ification anti-spoofing and speech deepfake detection,

H. Tak, J.-w. Jung, J. Patino, M. Kamble, M. Todisco, and N. Evans, “End-to-end spectro-temporal graph attention networks for speaker ver- ification anti-spoofing and speech deepfake detection,” arXiv preprint arXiv:2107.12710, 2021

work page arXiv 2021

[48] [48]

Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,

J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6367–6371

work page 2022

[49] [49]

Raw differentiable architecture search for speech deepfake and spoofing detection,

W. Ge, J. Patino, M. Todisco, and N. Evans, “Raw differentiable architecture search for speech deepfake and spoofing detection,” arXiv preprint arXiv:2107.12212, 2021

work page arXiv 2021

[50] [50]

A comparison of features for synthetic speech detection,

M. Sahidullah, T. Kinnunen, and C. Hanilc ¸i, “A comparison of features for synthetic speech detection,” in Proc. Interspeech 2015 , 2015, pp. 2087–2091

work page 2015

[51] [51]

A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection,

X. Wang and J. Yamagishi, “A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection,” in Proc. Interspeech 2021, 2021, pp. 4259–4263

work page 2021

[52] [52]

Mesonet: a compact facial video forgery detection network,

D. Afchar, V . Nozick, J. Yamagishi, and I. Echizen, “Mesonet: a compact facial video forgery detection network,” in 2018 IEEE International Workshop on Information Forensics and Security (WIFS), 2018, pp. 1–7

work page 2018

[53] [53]

Specrnet: Towards faster and more ac- cessible audio deepfake detection,

P. Kawa, M. Plata, and P. Syga, “Specrnet: Towards faster and more ac- cessible audio deepfake detection,” in 2022 IEEE International Confer- ence on Trust, Security and Privacy in Computing and Communications (TrustCom), 2022, pp. 792–799

work page 2022

[54] [54]

Fake speech detection using residual network with transformer encoder,

Z. Zhang, X. Yi, and X. Zhao, “Fake speech detection using residual network with transformer encoder,” in Proceedings of the 2021 ACM workshop on information hiding and multimedia security , 2021, pp. 13– 22

work page 2021

[55] [55]

Automatic speaker verification spoofing and deep- fake detection using wav2vec 2.0 and data augmentation,

H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” arXiv preprint arXiv:2202.12233, 2022

work page arXiv 2022

[56] [56]

The vicomtech audio deepfake detection system based on wav2vec2 for the 2022 add challenge,

J. M. Mart ´ın-Do˜nas and A. ´Alvarez, “The vicomtech audio deepfake detection system based on wav2vec2 for the 2022 add challenge,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2022, pp. 9241–9245

work page 2022

[57] [57]

Improved DeepFake Detection Using Whisper Features,

P. Kawa, M. Plata, M. Czuba, P. Szyma ´nski, and P. Syga, “Improved DeepFake Detection Using Whisper Features,” in Proc. INTERSPEECH 2023, 2023, pp. 4009–4013

work page 2023

[58] [58]

Complex-valued neural networks for voice anti-spoofing,

N. M. M ¨uller, P. Sperl, and K. B ¨ottinger, “Complex-valued neural networks for voice anti-spoofing,” in Proc. INTERSPEECH 2023, 2023, pp. 3814–3818

work page 2023

[59] [59]

Av-deepfake1m: A large-scale llm-driven audio-visual deepfake dataset,

“Av-deepfake1m: A large-scale llm-driven audio-visual deepfake dataset,” arXiv preprint arXiv:2311.15308 , 2023

work page arXiv 2023

[60] [60]

nmt · pypi,

“nmt · pypi,” https://pypi.org/project/nmt/, (Accessed on 01/15/2024)

work page 2024

[61] [61]

Signal estimation from modified short-time fourier transform,

D. Griffin and J. Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on acoustics, speech, and signal processing, vol. 32, no. 2, pp. 236–243, 1984

work page 1984

[62] [62]

Scaling speech technology to 1,000+ languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi et al. , “Scaling speech technology to 1,000+ languages,” arXiv preprint arXiv:2305.13516 , 2023

work page arXiv 2023

[63] [63]

Sam: Accenture non-binary voice,

Accenture, “Sam: Accenture non-binary voice,” https://github.com/ Sam-Accenture-Non-Binary-V oice/non-binary-voice-files#licensing, 2023, accessed on 01/02/2024

work page 2023

[64] [64]

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,

H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” in The Speaker and Language Recognition Workshop, 2022

work page 2022

[65] [65]

A study on data augmentation of reverberant speech for robust speech recognition,

T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2017, pp. 5220–5224

work page 2017

[66] [66]

ESC: Dataset for Environmental Sound Classi- fication,

K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proceedings of the 23rd Annual ACM Conference on Multimedia . ACM Press, pp. 1015–1018. [Online]. Available: http://dl.acm.org/ citation.cfm?doid=2733373.2806390

work page arXiv

[67] [67]

Free music archive - instrumental,

F. M. Archive, “Free music archive - instrumental,” https://freemusicarchive.org/genre/Instrumental/, 2023, accessed on 01/02/2024

work page 2023

[68] [68]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1

work page internal anchor Pith review Pith/arXiv arXiv 2015

[69] [69]

Speech is silver, silence is golden: What do ASVspoof-trained models really learn?

N. M. M ¨uller, F. Dieckmann, P. Czempin, R. Canals, K. B ¨ottinger, and J. Williams, “Speech is silver, silence is golden: What do asvspoof- trained models really learn?” arXiv preprint arXiv:2106.12914 , 2021

work page arXiv 2021

[70] [70]

openai/whisper-large · hugging face,

“openai/whisper-large · hugging face,” https://huggingface.co/openai/ whisper-large, (Accessed on 01/08/2024)

work page 2024

[71] [71]

Binary codes capable of correcting deletions, insertions, and reversals,

V . I. Levenshtein et al., “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady , vol. 10, no. 8. Soviet Union, 1966, pp. 707–710. APPENDIX architecture duration Azzurra-V oice 1.5 Bark 123.3 Capacitron 1.5 Chatterbox 14.5 E2 TTS 1.8 F5 TTS 1.7 FastPitch 2.1 FireRedTTS 5.5 FishTTS 7.4 GlowTTS 12.6 Griffin Li...

work page 1966