pith. machine review for the scientific record. sign in

arxiv: 2603.02364 · v3 · submitted 2026-03-02 · 💻 cs.SD · eess.AS

Recognition: 2 theorem links

· Lean Theorem

When Spoof Detectors Travel: Evaluation Across 66 Languages in the Low-Resource Language Spoofing Corpus

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:24 UTC · model grok-4.3

classification 💻 cs.SD eess.AS
keywords spoof detectioncross-lingual evaluationsynthetic speechlow-resource languagesdomain shiftcountermeasuresTTS systemsthreshold transfer
0
0 comments X

The pith

Spoof detectors exhibit large performance gaps across 66 languages when thresholds are transferred from external data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds LRLspoof, a corpus of 2732 hours of synthetic speech produced by 24 TTS systems in 66 languages, 45 of them low-resource. Eleven public countermeasures are tested by first fixing each model's decision threshold on pooled external benchmarks and then applying that same threshold to the new corpus. This yields a spoof rejection rate for each language without any genuine speech from the target languages. The measurements reveal clear differences in how well each model rejects spoofs, with the size of the gap depending on both the model and the language. The pattern indicates that language introduces its own domain shift separate from other acoustic factors.

Core claim

When 11 countermeasures are evaluated on the LRLspoof corpus using an EER operating point calibrated on external pooled data, spoof rejection rates vary markedly across the 66 languages in a model-dependent manner, demonstrating that language functions as an independent source of domain shift for spoof detection.

What carries the argument

Threshold transfer: calibrate an equal-error-rate operating point on pooled external benchmarks, then apply the resulting fixed threshold to compute spoof rejection rate on the new multilingual corpus without target-domain bonafide speech.

If this is right

  • Spoof detectors trained or tuned on high-resource languages cannot be assumed to deliver uniform protection when deployed on low-resource languages.
  • Model rankings based on average performance across languages can mask large per-language failures that matter for global applications.
  • The LRLspoof corpus supplies a public benchmark that future countermeasures can use to measure cross-lingual robustness without collecting local bonafide data.
  • Language-specific evaluation protocols become necessary rather than optional for any spoof detector intended for multilingual use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Voice-authentication systems intended for worldwide use may require per-language threshold adaptation or additional language-aware training to keep security levels consistent.
  • The observed disparity parallels domain-shift problems already documented in speech recognition and speaker verification when moving across languages or dialects.
  • Extending the evaluation to include real-world replay and voice-conversion attacks rather than only TTS output would test whether the language effect persists outside controlled synthesis.
  • Corpus expansion with matched bonafide recordings in a subset of the 66 languages would allow direct comparison of the threshold-transfer method against conventional EER measurement.

Load-bearing premise

A single threshold chosen on external benchmarks remains a valid and unbiased performance measure when applied to new languages that supply no genuine speech for recalibration.

What would settle it

Re-running the same threshold-transfer protocol on the 66-language corpus and finding nearly identical spoof rejection rates for multiple models across all languages would show that language does not act as an independent domain shift.

Figures

Figures reproduced from arXiv: 2603.02364 by Grach Mkrtchian, Kirill Borodin, Maxim Maslov, Mikhail Gorodnichev, Vasiliy Kudryavtsev.

Figure 1
Figure 1. Figure 1: Dataset duration distribution across languages (left) and TTS models (right) [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

We introduce LRLspoof, a large-scale multilingual synthetic-speech corpus for cross-lingual spoof detection, comprising 2,732 hours of audio generated with 24 open-source TTS systems across 66 languages, including 45 low-resource languages under our operational definition. To evaluate robustness without requiring target-domain bonafide speech, we benchmark 11 publicly available countermeasures using threshold transfer: for each model we calibrate an EER operating point on pooled external benchmarks and apply the resulting threshold, reporting spoof rejection rate (SRR). Results show model-dependent cross-lingual disparity, with spoof rejection varying markedly across languages even under controlled conditions, highlighting language as an independent source of domain shift in spoof detection. The dataset is publicly available at \href{https://huggingface.co/datasets/lab260/LRLspoof}{\textbf{\underline{\textit{HuggingFace}}}} and \href{https://modelscope.cn/datasets/lab260/LRLspoof}{\textbf{\underline{\textit{ModelScope}}}}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the LRLspoof corpus comprising 2,732 hours of synthetic speech from 24 open-source TTS systems across 66 languages (45 low-resource). It evaluates 11 public countermeasures via threshold transfer: an EER operating point is calibrated on pooled external benchmarks and applied to compute spoof rejection rate (SRR) on the new corpus, revealing model-dependent cross-lingual performance disparities that the authors attribute to language as an independent domain shift.

Significance. If the threshold-transfer protocol is shown to maintain a consistent operating point, the work supplies a large-scale public resource for cross-lingual spoof detection research and supplies concrete evidence that language-induced score shifts can dominate detector behavior even under controlled synthesis conditions. The open release on Hugging Face and ModelScope is a clear asset for reproducibility.

major comments (2)
  1. [Evaluation protocol (threshold transfer procedure)] The central claim of model-dependent cross-lingual disparity in SRR rests on applying a single EER-derived threshold calibrated on external pooled benchmarks. Because no target-domain bonafide speech is available, the paper cannot verify that this threshold preserves a constant false-acceptance rate across languages; any language-specific shift in bonafide score statistics would move the effective decision boundary, so reported SRR differences may conflate changes in spoof and bonafide distributions.
  2. [Abstract and §4 (Experiments)] The manuscript supplies no details on model-selection criteria for the 11 countermeasures, the precise EER calibration procedure on the external benchmarks (e.g., which corpora, how scores are pooled, exact EER computation), or any controls for TTS quality variation across the 66 languages. These omissions limit independent verification of the disparity claim.
minor comments (1)
  1. [Abstract] The dataset citation links in the abstract are functional but could be repeated in the main text with DOIs or persistent identifiers for easier access.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity and completeness where possible.

read point-by-point responses
  1. Referee: [Evaluation protocol (threshold transfer procedure)] The central claim of model-dependent cross-lingual disparity in SRR rests on applying a single EER-derived threshold calibrated on external pooled benchmarks. Because no target-domain bonafide speech is available, the paper cannot verify that this threshold preserves a constant false-acceptance rate across languages; any language-specific shift in bonafide score statistics would move the effective decision boundary, so reported SRR differences may conflate changes in spoof and bonafide distributions.

    Authors: We agree that the lack of target-domain bonafide speech precludes direct confirmation that the transferred threshold maintains a constant false-acceptance rate across languages. The threshold-transfer protocol was deliberately selected to reflect realistic low-resource deployment conditions where bonafide calibration data are unavailable. SRR therefore measures the fraction of synthetic utterances rejected at the externally derived EER operating point, capturing the net effect of language-induced shifts on spoof scores. In the revised manuscript we have added a dedicated paragraph in §4.3 that explicitly discusses the possibility of bonafide score shifts as a confounding factor and states that the reported disparities reflect the combined influence on detection performance under transferred thresholds. We have also added this limitation to the conclusions. revision: partial

  2. Referee: [Abstract and §4 (Experiments)] The manuscript supplies no details on model-selection criteria for the 11 countermeasures, the precise EER calibration procedure on the external benchmarks (e.g., which corpora, how scores are pooled, exact EER computation), or any controls for TTS quality variation across the 66 languages. These omissions limit independent verification of the disparity claim.

    Authors: We thank the referee for highlighting these omissions. In the revised version we have expanded §3.2 to describe the model-selection criteria: the 11 countermeasures were chosen for public availability, architectural diversity (CNN, ResNet, and transformer variants), and documented performance on prior ASVspoof challenges. Section 4.1 now specifies the calibration procedure in full: scores from the ASVspoof 2019 LA and ASVspoof 2021 LA evaluation partitions were pooled, the EER threshold was computed as the operating point equating false-acceptance and false-rejection rates on the pooled score distribution, and the exact percentile-based implementation is provided. We have also added an appendix section that controls for TTS quality by reporting SRR stratified by synthesis system and by language-resource category, using available objective quality predictors. These additions enable independent reproduction and verification of the disparity results. revision: yes

Circularity Check

0 steps flagged

No circularity: threshold transfer uses independent external calibration

full rationale

The paper defines SRR by applying an EER threshold calibrated exclusively on pooled external benchmarks (independent of LRLspoof) to the new corpus's spoof scores. This produces a direct empirical measurement of rejection rates across languages rather than any self-referential fit, prediction from fitted parameters, or reduction to the target data by construction. The observed cross-lingual SRR disparities are reported outcomes of this fixed-threshold procedure; they do not loop back to redefine the threshold or the corpus itself. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that external EER calibration transfers meaningfully to unseen languages and that the 24 TTS systems produce representative spoofs.

free parameters (1)
  • EER operating-point threshold
    Calibrated once on pooled external benchmarks and then frozen for all 66 languages.
axioms (1)
  • domain assumption External EER threshold remains appropriate for new languages without target bonafide data
    Invoked to justify the threshold-transfer protocol described in the abstract.

pith-pipeline@v0.9.0 · 5498 in / 1229 out tokens · 65928 ms · 2026-05-15T16:24:59.684925+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 3 internal anchors

  1. [1]

    Introduction Recent progress in text-to-speech (TTS) and voice conversion (VC) has made audio spoofing increasingly practical, raising the stakes for speaker verification and other speech-driven security applications [1]. To keep evaluation reproducible and compa- rable as attacks evolve, the community has developed shared tasks and benchmarks with standa...

  2. [2]

    When Spoof Detectors Travel: Evaluation Across 66 Languages in the Low-Resource Language Spoofing Corpus

    Related works Table 1:Comparison of representative spoofing corpora. L = number of languages, LRL = number of low-resource languages (per our operational definition). “Models” denotes the number of distinct speech generation systems used to create audio as reported by each dataset. “Hours” denotes the reported dura- tion of the speech data. Some entries a...

  3. [3]

    spans many languages and generation systems, but its low- resource coverage is smaller by the same criterion, which can constrain analyses centered on low-resource conditions. In contrast, our corpus is purpose-built for controlled cross- lingual spoof detection under explicit (language, synthesizer) shifts: we generate spoofed speech using a fixed suite ...

  4. [4]

    The corpus contains only syntheti- cally generated speech produced with a fixed set of open-source TTS synthesizers across 66 languages

    Dataset Creation We constructed a multilingual synthetic-speech corpus for spoof detection research. The corpus contains only syntheti- cally generated speech produced with a fixed set of open-source TTS synthesizers across 66 languages. We include widely used languages alongside many low-resource languages to facilitate controlled cross-lingual generaliz...

  5. [5]

    Experimental setup 4.1. Spoofing countermeasures We evaluate 11 publicly available spoofing CMs span- ning classical spectro-temporal architectures and large self- supervised encoders:aasist3[46],df arena 1b[47], df arena 500[47],res2tcn[48],rescapsguard[48], sls[49],ssl aasist[50],tcm add[51],nes2net[52], w2v2 1b[53], andw2v2 300[54]. 4.2. Spoof-only eva...

  6. [6]

    Results and Discussion 5.1. Overall robustness under threshold transfer We first summarize overall spoof rejection performance when transferring EER-calibrated thresholds from pooled external benchmarks to our corpus. Table 2 shows that threshold trans- fer can yield widely varying spoof rejection rates (SRR) across Table 2:Spoof rejection rate (SRR, %) a...

  7. [7]

    Conclusion Using the proposed LRLspoof corpus, a 2,732 hours of spoofed- only speech from 24 open-source TTS systems spanning 66 lan- guages, we evaluated 11 public CMs at a fixed EER-calibrated operating point set on pooled external benchmarks, and tested them without adaptation across all language and synthesizer subsets. The results suggest that many C...

  8. [8]

    Generative AI Use Disclosure This work uses generative models as part of the data creation pipeline: portions of the dataset were synthesized using text- to-speech (TTS) systems to produce spoofed (synthetic) speech samples for anti-spoofing research. Generative AI tools were not used to develop the core scientific contributions beyond this disclosed data...

  9. [9]

    A Survey on Speech Deepfake Detection,

    M. Li, Y . Ahmadiadli, and X.-P. Zhang, “A Survey on Speech Deepfake Detection,”ACM Computing Surveys, vol. 57, no. 7, pp. 1–38, 2025

  10. [10]

    ASVspoof 2019: Future Horizons in Spoofed and Fake Au- dio Detection,

    M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. H. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future Horizons in Spoofed and Fake Au- dio Detection,” inInterspeech 2019, 2019, pp. 1008–1012

  11. [11]

    Asvspoof 2021: Towards spoofed and deepfake speech de- tection in the wild,

    X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kin- nunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, and K. A. Lee, “Asvspoof 2021: Towards spoofed and deepfake speech de- tection in the wild,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2507–2522, 2023

  12. [12]

    Asvspoof 5: Design, collection and validation of resources for spoofing, deepfake, and adversarial attack detection using crowdsourced speech,

    X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunen, N. Evans, K. A. Lee, J. Yamagishi, M. Jeong, G. Zhu, Y . Zang, Y . Zhang, S. Maiti, F. Lux, N. M¨uller, W. Zhang, C. Sun, S. Hou, S. Lyu, S. Le Maguer, C. Gong, H. Guo, L. Chen, and V . Singh, “Asvspoof 5: Design, collection and validation o...

  13. [13]

    Add 2022: the first audio deep synthe- sis detection challenge,

    J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y . Bai, C. Fan, S. Liang, S. Wang, S. Zhang, X. Yan, L. Xu, Z. Wen, and H. Li, “Add 2022: the first audio deep synthe- sis detection challenge,” inICASSP 2022 - 2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 9216–9220

  14. [14]

    MLAAD: The Multi-Language Audio Anti-Spoofing Dataset,

    N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G ¨olge, T. M ¨uller, P. Syga, P. Sperl, and K. B ¨ottinger, “MLAAD: The Multi-Language Audio Anti-Spoofing Dataset,” inProc. Interna- tional Joint Conference on Neural Networks (IJCNN), 2024, pp. 1–7

  15. [15]

    IndicSynth: A Large- Scale Multilingual Synthetic Speech Dataset for Low-Resource Indian Languages,

    D. V . Sharma, V . Ekbote, and A. Gupta, “IndicSynth: A Large- Scale Multilingual Synthetic Speech Dataset for Low-Resource Indian Languages,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 22 037–22 060

  16. [16]

    Ai-synthesized voice detection using neural vocoder artifacts,

    C. Sun, S. Jia, S. Hou, and S. Lyu, “Ai-synthesized voice detection using neural vocoder artifacts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. Computer Vision Foundation, 2023

  17. [17]

    SoundStream: An End-to-End Neural Audio Codec,

    N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An End-to-End Neural Audio Codec,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 30, pp. 495–507, 2022

  18. [18]

    High Fidelity Neural Audio Compression

    A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High Fidelity Neural Audio Compression,”Transactions on Machine Learning Research (TMLR), 2022. [Online]. Available: https: //arxiv.org/abs/2210.13438

  19. [19]

    Raw- Boost: A Raw Data Boosting and Augmentation Method Applied to Automatic Speaker Verification Anti-Spoofing,

    H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Raw- Boost: A Raw Data Boosting and Augmentation Method Applied to Automatic Speaker Verification Anti-Spoofing,” inProc. IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2022

  20. [20]

    Adversarial Attacks on Spoofing Countermeasures of Automatic Speaker Verification,

    S. Liu, H. Wu, H.-y. Lee, and H. Meng, “Adversarial Attacks on Spoofing Countermeasures of Automatic Speaker Verification,” inProc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 312–319

  21. [21]

    Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing,

    T. Liu, I. Kukanov, Z. Pan, Q. Wang, H. B. Sailor, and K. A. Lee, “Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing,” inProc. IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 1185–1192

  22. [22]

    Revealing Cross-Lingual Bias in Syn- thetic Speech Detection under Controlled Conditions,

    V . Moreno, J. Lima, F. Sim ˜oes, R. Violato, M. Uliani Neto, F. Runstein, and P. Costa, “Revealing Cross-Lingual Bias in Syn- thetic Speech Detection under Controlled Conditions,” inProc. 5th Symposium on Security and Privacy in Speech Communica- tion (SPSC), 2025

  23. [23]

    Un- masking real-world audio deepfakes: A data-centric approach,

    D. Combei, A. Stan, D. Oneata, N. M ¨uller, and H. Cucu, “Un- masking real-world audio deepfakes: A data-centric approach,” in Interspeech 2025, 2025, pp. 5343–5347

  24. [24]

    MLADDC: Multi-lingual audio deepfake detection corpus,

    A. J. Shah, R. M. Purohit, D. H. Vaghera, and H. Patil, “MLADDC: Multi-lingual audio deepfake detection corpus,” inAudio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation, 2024. [Online]. Available: https://openreview.net/forum?id=ic3HvoOTeU

  25. [25]

    SynHate: De- tecting Hate Speech in Synthetic Deepfake Audio,

    R. Ranjan, K. Pipariya, M. Vatsa, and R. Singh, “SynHate: De- tecting Hate Speech in Synthetic Deepfake Audio,” inProc. Inter- speech, 2025, pp. 5623–5627

  26. [26]

    Multimodal Zero- Shot Framework for Deepfake Hate Speech Detection in Low- Resource Languages,

    R. Ranjan, L. Ayinala, M. Vatsa, and R. Singh, “Multimodal Zero- Shot Framework for Deepfake Hate Speech Detection in Low- Resource Languages,” inInterspeech 2025, 2025, pp. 1678–1682

  27. [27]

    SpeechFake: A large-scale multilingual speech deepfake dataset incorporating cutting-edge generation methods,

    W. Huang, Y . Gu, Z. Wang, H. Zhu, and Y . Qian, “SpeechFake: A large-scale multilingual speech deepfake dataset incorporating cutting-edge generation methods,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 9985–9...

  28. [28]

    SafeEar: Content Privacy-Preserving Audio Deepfake Detection,

    X. Li, K. Li, Y . Zheng, C. Yan, X. Ji, and W. Xu, “SafeEar: Content Privacy-Preserving Audio Deepfake Detection,” inProc. ACM CCS, 2024

  29. [29]

    Common voice: A massively-multilingual speech corpus,

    R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, May 2020, pp. 4218–4222. [Online]. Available: http...

  30. [30]

    eSpeak NG: Open Source Speech Syn- thesizer,

    eSpeak NG contributors, “eSpeak NG: Open Source Speech Syn- thesizer,” https://github.com/espeak-ng/espeak-ng, accessed: 2026-01-12

  31. [31]

    RHV oice: a free and open-source speech synthesizer,

    RHV oice contributors, “RHV oice: a free and open-source speech synthesizer,” https://github.com/RHVoice/RHVoice, accessed: 2026-01-12

  32. [32]

    Aholab Signal Processing Laboratory, “AhoTTS,” https://github .com/aholab/AhoTTS, accessed: 2026-01-12

  33. [33]

    Silero Models: Text-to-Speech,

    Silero Team, “Silero Models: Text-to-Speech,” https://github.com /snakers4/silero-models, 2026, accessed: 2026-01-12

  34. [34]

    SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Lan- guage Processing,

    J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y . Wu, S. Liu, T. Ko, Q. Li, Y . Zhang, Z. Wei, Y . Qian, J. Li, and F. Wei, “SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Lan- guage Processing,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5723–5738

  35. [35]

    Fastpitch: Parallel text-to-speech with pitch pre- diction,

    A. Ła ´ncucki, “Fastpitch: Parallel text-to-speech with pitch pre- diction,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6588–6592

  36. [36]

    Matcha-TTS: A Fast TTS Architecture with Conditional Flow Matching,

    S. Mehta, R. Tu, J. Beskow, ´E. Sz ´ekely, and G. E. Henter, “Matcha-TTS: A Fast TTS Architecture with Conditional Flow Matching,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

  37. [37]

    Natural Language Guidance of High- Fidelity Text-to-Speech with Synthetic Annotations,

    D. Lyth and S. King, “Natural Language Guidance of High- Fidelity Text-to-Speech with Synthetic Annotations,”arXiv preprint arXiv:2402.01912, 2024

  38. [38]

    Piper: A Fast, Local Neural Text-to-Speech System,

    Rhasspy / Open Home Foundation V oice contributors, “Piper: A Fast, Local Neural Text-to-Speech System,” https://github.com/r hasspy/piper, 2026, accessed: 2026-01-12

  39. [39]

    MeloTTS: High-quality multi- lingual multi-accent text-to-speech,

    W. Zhao, X. Yu, and Z. Qin, “MeloTTS: High-quality multi- lingual multi-accent text-to-speech,” GitHub repository, 2023. [Online]. Available: https://github.com/myshell-ai/MeloTTS

  40. [40]

    Scaling Speech Technology to 1000+ Languages,

    V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y . Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scaling Speech Technology to 1000+ Languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024. [Online]. Available: https://jmlr.org/papers/v25/23-1318.html

  41. [41]

    Towards building text-to-speech systems for the next billion users,

    G. K. Kumar, P. S V , P. Kumar, M. M. Khapra, and K. Nandaku- mar, “Towards building text-to-speech systems for the next billion users,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1– 5

  42. [42]

    Mul- tilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration,

    R. Yeshpanov, S. Mussakhojayeva, and Y . Khassanov, “Mul- tilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration,” inInterspeech 2023, 2023, pp. 5521–5525

  43. [43]

    The IMS Toucan system for the Blizzard Challenge 2021,

    F. Lux, J. Koch, A. Schweitzer, and N. Thang Vu, “The IMS Toucan system for the Blizzard Challenge 2021,” inThe Blizzard Challenge 2021, 2021, pp. 14–19

  44. [44]

    QirimtatarTTS: Text-to-Speech for Crimean Tatar,

    Paniv, Yurii (robinhad), “QirimtatarTTS: Text-to-Speech for Crimean Tatar,” https://github.com/robinhad/qirimtatar- tts, accessed: 2026-01-12

  45. [45]

    XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,

    E. Casanova, K. Davis, E. G ¨olge, G. G ¨oknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi, and J. We- ber, “XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,” inInterspeech 2024, 2024, pp. 4978–4982

  46. [46]

    XTTS-v2,

    Coqui developers, “XTTS-v2,” https://huggingface.co/coqui/X TTS-v2, 2026, accessed: 2026-01-12

  47. [47]

    OuteTTS,

    edwko developers, “OuteTTS,” https://github.com/edwko/OuteT TS, accessed: 2026-01-12

  48. [48]

    Chatterbox: Open-Source Text-to- Speech Models,

    Resemble AI developers, “Chatterbox: Open-Source Text-to- Speech Models,” https://github.com/resemble-ai/chatterbox, accessed: 2026-01-12

  49. [49]

    F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,

    Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Association for Computational Linguistics, 2025. [Online]. Available: https://aclanthology.org/2025.ac...

  50. [50]

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Ma, Z. Gao, and Z. Yan, “CosyV oice: A Scal- able Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens,”arXiv preprint arXiv:2407.05407, 2024

  51. [51]

    Zonos-v0.1,

    Zyphra developers, “Zonos-v0.1,” https://github.com/Zyphra/Zo nos, accessed: 2026-01-12

  52. [52]

    Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis,

    S. Liao, Y . Wang, T. Li, Y . Cheng, R. Zhang, R. Zhou, and Y . Xing, “Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis,”arXiv preprint arXiv:2411.01156, 2024

  53. [53]

    Kokoro: Inference library for Kokoro- 82M,

    hexgrad developers, “Kokoro: Inference library for Kokoro- 82M,” https://github.com/hexgrad/kokoro, accessed: 2026-01-12

  54. [54]

    AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge,

    K. Borodin, V . Kudryavtsev, D. Korzh, A. Efimenko, G. Mkrtchian, M. Gorodnichev, and O. Y . Rogov, “AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 48–55

  55. [55]

    Do compact ssl backbones matter for audio deepfake detection? a controlled study with raptor,

    A. Kulkarni, S. Dowerah, A. Kulkarni, T. Alum ¨ae, and M. M. Doss, “Do compact ssl backbones matter for audio deepfake detection? a controlled study with raptor,” 2026. [Online]. Available: https://arxiv.org/abs/2603.06164

  56. [56]

    Capsule-based and tcn-based approaches for spoofing detection in voice biometry,

    K. Borodin, V . Kudryavtsev, G. Mkrtchian, and M. Gorodnichev, “Capsule-based and tcn-based approaches for spoofing detection in voice biometry,”Engineering, Technology & Applied Science Research, vol. 14, no. 6, pp. 18 409–18 414, 2024

  57. [57]

    Audio deepfake detection with self-supervised xls-r and sls classifier,

    Q. Zhang, S. Wen, and T. Hu, “Audio deepfake detection with self-supervised xls-r and sls classifier,” inProceedings of the 32nd ACM International Conference on Multimedia, ser. MM ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 6765–6773. [Online]. Available: https: //doi.org/10.1145/3664647.3681345

  58. [58]

    Automatic speaker verification spoofing and deep- fake detection using wav2vec 2.0 and data augmentation,

    H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deep- fake detection using wav2vec 2.0 and data augmentation,” inThe Speaker and Language Recognition Workshop, 2022

  59. [59]

    Leveraging SSL Speech Features and Mamba for Enhanced DeepFake Detection,

    H. M. Tran, D. Lolive, D. Guennec, A. Sini, A. Delhay, and P.- F. Marteau, “Leveraging SSL Speech Features and Mamba for Enhanced DeepFake Detection,” inInterspeech 2025, 2025, pp. 5323–5327

  60. [60]

    Nes2net: A lightweight nested architecture for foundation model driven speech anti-spoofing,

    T. Liu, D.-T. Truong, R. Kumar Das, K. Aik Lee, and H. Li, “Nes2net: A lightweight nested architecture for foundation model driven speech anti-spoofing,”IEEE Transactions on Information Forensics and Security, vol. 20, pp. 12 005–12 018, 2025

  61. [61]

    wav2vec2-xls-r-1b DeepFake (AI4TRUST),

    D. Combei, “wav2vec2-xls-r-1b DeepFake (AI4TRUST),” https: //huggingface.co/DavidCombei/wav2vec2-xls-r-1b-DeepFake-A I4TRUST, 2025, accessed: 2026-01-28

  62. [62]

    wav2vec2-xls-r-300m deepfake V1,

    ——, “wav2vec2-xls-r-300m deepfake V1,” https://huggingfac e.co/DavidCombei/wav2vec2-xls-r-300m-deepfake-V1, 2025, accessed: 2026-01-28

  63. [63]

    Does Audio Deep- fake Detection Generalize?

    Nicolas M ¨uller and Pavel Czempin and Franziska Diekmann and Adam Froghyar and Konstantin B ¨ottinger, “Does Audio Deep- fake Detection Generalize?” inInterspeech 2022, 2022, pp. 2783– 2787

  64. [64]

    DFADD: The Diffusion and Flow- Matching Based Audio Deepfake Dataset,

    J. Du, I.-M. Lin, I.-H. Chiu, X. Chen, H. Wu, W. Ren, Y . Tsao, H.-y. Lee, and J.-S. R. Jang, “DFADD: The Diffusion and Flow- Matching Based Audio Deepfake Dataset,” inProc. IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 921– 928