Recognition: 2 theorem links
· Lean TheoremWhen Spoof Detectors Travel: Evaluation Across 66 Languages in the Low-Resource Language Spoofing Corpus
Pith reviewed 2026-05-15 16:24 UTC · model grok-4.3
The pith
Spoof detectors exhibit large performance gaps across 66 languages when thresholds are transferred from external data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When 11 countermeasures are evaluated on the LRLspoof corpus using an EER operating point calibrated on external pooled data, spoof rejection rates vary markedly across the 66 languages in a model-dependent manner, demonstrating that language functions as an independent source of domain shift for spoof detection.
What carries the argument
Threshold transfer: calibrate an equal-error-rate operating point on pooled external benchmarks, then apply the resulting fixed threshold to compute spoof rejection rate on the new multilingual corpus without target-domain bonafide speech.
If this is right
- Spoof detectors trained or tuned on high-resource languages cannot be assumed to deliver uniform protection when deployed on low-resource languages.
- Model rankings based on average performance across languages can mask large per-language failures that matter for global applications.
- The LRLspoof corpus supplies a public benchmark that future countermeasures can use to measure cross-lingual robustness without collecting local bonafide data.
- Language-specific evaluation protocols become necessary rather than optional for any spoof detector intended for multilingual use.
Where Pith is reading between the lines
- Voice-authentication systems intended for worldwide use may require per-language threshold adaptation or additional language-aware training to keep security levels consistent.
- The observed disparity parallels domain-shift problems already documented in speech recognition and speaker verification when moving across languages or dialects.
- Extending the evaluation to include real-world replay and voice-conversion attacks rather than only TTS output would test whether the language effect persists outside controlled synthesis.
- Corpus expansion with matched bonafide recordings in a subset of the 66 languages would allow direct comparison of the threshold-transfer method against conventional EER measurement.
Load-bearing premise
A single threshold chosen on external benchmarks remains a valid and unbiased performance measure when applied to new languages that supply no genuine speech for recalibration.
What would settle it
Re-running the same threshold-transfer protocol on the 66-language corpus and finding nearly identical spoof rejection rates for multiple models across all languages would show that language does not act as an independent domain shift.
Figures
read the original abstract
We introduce LRLspoof, a large-scale multilingual synthetic-speech corpus for cross-lingual spoof detection, comprising 2,732 hours of audio generated with 24 open-source TTS systems across 66 languages, including 45 low-resource languages under our operational definition. To evaluate robustness without requiring target-domain bonafide speech, we benchmark 11 publicly available countermeasures using threshold transfer: for each model we calibrate an EER operating point on pooled external benchmarks and apply the resulting threshold, reporting spoof rejection rate (SRR). Results show model-dependent cross-lingual disparity, with spoof rejection varying markedly across languages even under controlled conditions, highlighting language as an independent source of domain shift in spoof detection. The dataset is publicly available at \href{https://huggingface.co/datasets/lab260/LRLspoof}{\textbf{\underline{\textit{HuggingFace}}}} and \href{https://modelscope.cn/datasets/lab260/LRLspoof}{\textbf{\underline{\textit{ModelScope}}}}
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the LRLspoof corpus comprising 2,732 hours of synthetic speech from 24 open-source TTS systems across 66 languages (45 low-resource). It evaluates 11 public countermeasures via threshold transfer: an EER operating point is calibrated on pooled external benchmarks and applied to compute spoof rejection rate (SRR) on the new corpus, revealing model-dependent cross-lingual performance disparities that the authors attribute to language as an independent domain shift.
Significance. If the threshold-transfer protocol is shown to maintain a consistent operating point, the work supplies a large-scale public resource for cross-lingual spoof detection research and supplies concrete evidence that language-induced score shifts can dominate detector behavior even under controlled synthesis conditions. The open release on Hugging Face and ModelScope is a clear asset for reproducibility.
major comments (2)
- [Evaluation protocol (threshold transfer procedure)] The central claim of model-dependent cross-lingual disparity in SRR rests on applying a single EER-derived threshold calibrated on external pooled benchmarks. Because no target-domain bonafide speech is available, the paper cannot verify that this threshold preserves a constant false-acceptance rate across languages; any language-specific shift in bonafide score statistics would move the effective decision boundary, so reported SRR differences may conflate changes in spoof and bonafide distributions.
- [Abstract and §4 (Experiments)] The manuscript supplies no details on model-selection criteria for the 11 countermeasures, the precise EER calibration procedure on the external benchmarks (e.g., which corpora, how scores are pooled, exact EER computation), or any controls for TTS quality variation across the 66 languages. These omissions limit independent verification of the disparity claim.
minor comments (1)
- [Abstract] The dataset citation links in the abstract are functional but could be repeated in the main text with DOIs or persistent identifiers for easier access.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity and completeness where possible.
read point-by-point responses
-
Referee: [Evaluation protocol (threshold transfer procedure)] The central claim of model-dependent cross-lingual disparity in SRR rests on applying a single EER-derived threshold calibrated on external pooled benchmarks. Because no target-domain bonafide speech is available, the paper cannot verify that this threshold preserves a constant false-acceptance rate across languages; any language-specific shift in bonafide score statistics would move the effective decision boundary, so reported SRR differences may conflate changes in spoof and bonafide distributions.
Authors: We agree that the lack of target-domain bonafide speech precludes direct confirmation that the transferred threshold maintains a constant false-acceptance rate across languages. The threshold-transfer protocol was deliberately selected to reflect realistic low-resource deployment conditions where bonafide calibration data are unavailable. SRR therefore measures the fraction of synthetic utterances rejected at the externally derived EER operating point, capturing the net effect of language-induced shifts on spoof scores. In the revised manuscript we have added a dedicated paragraph in §4.3 that explicitly discusses the possibility of bonafide score shifts as a confounding factor and states that the reported disparities reflect the combined influence on detection performance under transferred thresholds. We have also added this limitation to the conclusions. revision: partial
-
Referee: [Abstract and §4 (Experiments)] The manuscript supplies no details on model-selection criteria for the 11 countermeasures, the precise EER calibration procedure on the external benchmarks (e.g., which corpora, how scores are pooled, exact EER computation), or any controls for TTS quality variation across the 66 languages. These omissions limit independent verification of the disparity claim.
Authors: We thank the referee for highlighting these omissions. In the revised version we have expanded §3.2 to describe the model-selection criteria: the 11 countermeasures were chosen for public availability, architectural diversity (CNN, ResNet, and transformer variants), and documented performance on prior ASVspoof challenges. Section 4.1 now specifies the calibration procedure in full: scores from the ASVspoof 2019 LA and ASVspoof 2021 LA evaluation partitions were pooled, the EER threshold was computed as the operating point equating false-acceptance and false-rejection rates on the pooled score distribution, and the exact percentile-based implementation is provided. We have also added an appendix section that controls for TTS quality by reporting SRR stratified by synthesis system and by language-resource category, using available objective quality predictors. These additions enable independent reproduction and verification of the disparity results. revision: yes
Circularity Check
No circularity: threshold transfer uses independent external calibration
full rationale
The paper defines SRR by applying an EER threshold calibrated exclusively on pooled external benchmarks (independent of LRLspoof) to the new corpus's spoof scores. This produces a direct empirical measurement of rejection rates across languages rather than any self-referential fit, prediction from fitted parameters, or reduction to the target data by construction. The observed cross-lingual SRR disparities are reported outcomes of this fixed-threshold procedure; they do not loop back to redefine the threshold or the corpus itself. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the derivation.
Axiom & Free-Parameter Ledger
free parameters (1)
- EER operating-point threshold
axioms (1)
- domain assumption External EER threshold remains appropriate for new languages without target bonafide data
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we calibrate an EER operating point on pooled external benchmarks and apply the resulting threshold, reporting spoof rejection rate (SRR)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Results show model-dependent cross-lingual disparity... language as an independent source of domain shift
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introduction Recent progress in text-to-speech (TTS) and voice conversion (VC) has made audio spoofing increasingly practical, raising the stakes for speaker verification and other speech-driven security applications [1]. To keep evaluation reproducible and compa- rable as attacks evolve, the community has developed shared tasks and benchmarks with standa...
-
[2]
Related works Table 1:Comparison of representative spoofing corpora. L = number of languages, LRL = number of low-resource languages (per our operational definition). “Models” denotes the number of distinct speech generation systems used to create audio as reported by each dataset. “Hours” denotes the reported dura- tion of the speech data. Some entries a...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
spans many languages and generation systems, but its low- resource coverage is smaller by the same criterion, which can constrain analyses centered on low-resource conditions. In contrast, our corpus is purpose-built for controlled cross- lingual spoof detection under explicit (language, synthesizer) shifts: we generate spoofed speech using a fixed suite ...
-
[4]
Dataset Creation We constructed a multilingual synthetic-speech corpus for spoof detection research. The corpus contains only syntheti- cally generated speech produced with a fixed set of open-source TTS synthesizers across 66 languages. We include widely used languages alongside many low-resource languages to facilitate controlled cross-lingual generaliz...
-
[5]
Experimental setup 4.1. Spoofing countermeasures We evaluate 11 publicly available spoofing CMs span- ning classical spectro-temporal architectures and large self- supervised encoders:aasist3[46],df arena 1b[47], df arena 500[47],res2tcn[48],rescapsguard[48], sls[49],ssl aasist[50],tcm add[51],nes2net[52], w2v2 1b[53], andw2v2 300[54]. 4.2. Spoof-only eva...
-
[6]
Results and Discussion 5.1. Overall robustness under threshold transfer We first summarize overall spoof rejection performance when transferring EER-calibrated thresholds from pooled external benchmarks to our corpus. Table 2 shows that threshold trans- fer can yield widely varying spoof rejection rates (SRR) across Table 2:Spoof rejection rate (SRR, %) a...
-
[7]
Conclusion Using the proposed LRLspoof corpus, a 2,732 hours of spoofed- only speech from 24 open-source TTS systems spanning 66 lan- guages, we evaluated 11 public CMs at a fixed EER-calibrated operating point set on pooled external benchmarks, and tested them without adaptation across all language and synthesizer subsets. The results suggest that many C...
-
[8]
Generative AI Use Disclosure This work uses generative models as part of the data creation pipeline: portions of the dataset were synthesized using text- to-speech (TTS) systems to produce spoofed (synthetic) speech samples for anti-spoofing research. Generative AI tools were not used to develop the core scientific contributions beyond this disclosed data...
-
[9]
A Survey on Speech Deepfake Detection,
M. Li, Y . Ahmadiadli, and X.-P. Zhang, “A Survey on Speech Deepfake Detection,”ACM Computing Surveys, vol. 57, no. 7, pp. 1–38, 2025
work page 2025
-
[10]
ASVspoof 2019: Future Horizons in Spoofed and Fake Au- dio Detection,
M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. H. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future Horizons in Spoofed and Fake Au- dio Detection,” inInterspeech 2019, 2019, pp. 1008–1012
work page 2019
-
[11]
Asvspoof 2021: Towards spoofed and deepfake speech de- tection in the wild,
X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kin- nunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, and K. A. Lee, “Asvspoof 2021: Towards spoofed and deepfake speech de- tection in the wild,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2507–2522, 2023
work page 2021
-
[12]
X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunen, N. Evans, K. A. Lee, J. Yamagishi, M. Jeong, G. Zhu, Y . Zang, Y . Zhang, S. Maiti, F. Lux, N. M¨uller, W. Zhang, C. Sun, S. Hou, S. Lyu, S. Le Maguer, C. Gong, H. Guo, L. Chen, and V . Singh, “Asvspoof 5: Design, collection and validation o...
work page 2026
-
[13]
Add 2022: the first audio deep synthe- sis detection challenge,
J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y . Bai, C. Fan, S. Liang, S. Wang, S. Zhang, X. Yan, L. Xu, Z. Wen, and H. Li, “Add 2022: the first audio deep synthe- sis detection challenge,” inICASSP 2022 - 2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 9216–9220
work page 2022
-
[14]
MLAAD: The Multi-Language Audio Anti-Spoofing Dataset,
N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G ¨olge, T. M ¨uller, P. Syga, P. Sperl, and K. B ¨ottinger, “MLAAD: The Multi-Language Audio Anti-Spoofing Dataset,” inProc. Interna- tional Joint Conference on Neural Networks (IJCNN), 2024, pp. 1–7
work page 2024
-
[15]
IndicSynth: A Large- Scale Multilingual Synthetic Speech Dataset for Low-Resource Indian Languages,
D. V . Sharma, V . Ekbote, and A. Gupta, “IndicSynth: A Large- Scale Multilingual Synthetic Speech Dataset for Low-Resource Indian Languages,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 22 037–22 060
work page 2025
-
[16]
Ai-synthesized voice detection using neural vocoder artifacts,
C. Sun, S. Jia, S. Hou, and S. Lyu, “Ai-synthesized voice detection using neural vocoder artifacts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. Computer Vision Foundation, 2023
work page 2023
-
[17]
SoundStream: An End-to-End Neural Audio Codec,
N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An End-to-End Neural Audio Codec,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 30, pp. 495–507, 2022
work page 2022
-
[18]
High Fidelity Neural Audio Compression
A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High Fidelity Neural Audio Compression,”Transactions on Machine Learning Research (TMLR), 2022. [Online]. Available: https: //arxiv.org/abs/2210.13438
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Raw- Boost: A Raw Data Boosting and Augmentation Method Applied to Automatic Speaker Verification Anti-Spoofing,” inProc. IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2022
work page 2022
-
[20]
Adversarial Attacks on Spoofing Countermeasures of Automatic Speaker Verification,
S. Liu, H. Wu, H.-y. Lee, and H. Meng, “Adversarial Attacks on Spoofing Countermeasures of Automatic Speaker Verification,” inProc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 312–319
work page 2019
-
[21]
Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing,
T. Liu, I. Kukanov, Z. Pan, Q. Wang, H. B. Sailor, and K. A. Lee, “Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing,” inProc. IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 1185–1192
work page 2024
-
[22]
Revealing Cross-Lingual Bias in Syn- thetic Speech Detection under Controlled Conditions,
V . Moreno, J. Lima, F. Sim ˜oes, R. Violato, M. Uliani Neto, F. Runstein, and P. Costa, “Revealing Cross-Lingual Bias in Syn- thetic Speech Detection under Controlled Conditions,” inProc. 5th Symposium on Security and Privacy in Speech Communica- tion (SPSC), 2025
work page 2025
-
[23]
Un- masking real-world audio deepfakes: A data-centric approach,
D. Combei, A. Stan, D. Oneata, N. M ¨uller, and H. Cucu, “Un- masking real-world audio deepfakes: A data-centric approach,” in Interspeech 2025, 2025, pp. 5343–5347
work page 2025
-
[24]
MLADDC: Multi-lingual audio deepfake detection corpus,
A. J. Shah, R. M. Purohit, D. H. Vaghera, and H. Patil, “MLADDC: Multi-lingual audio deepfake detection corpus,” inAudio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation, 2024. [Online]. Available: https://openreview.net/forum?id=ic3HvoOTeU
work page 2024
-
[25]
SynHate: De- tecting Hate Speech in Synthetic Deepfake Audio,
R. Ranjan, K. Pipariya, M. Vatsa, and R. Singh, “SynHate: De- tecting Hate Speech in Synthetic Deepfake Audio,” inProc. Inter- speech, 2025, pp. 5623–5627
work page 2025
-
[26]
Multimodal Zero- Shot Framework for Deepfake Hate Speech Detection in Low- Resource Languages,
R. Ranjan, L. Ayinala, M. Vatsa, and R. Singh, “Multimodal Zero- Shot Framework for Deepfake Hate Speech Detection in Low- Resource Languages,” inInterspeech 2025, 2025, pp. 1678–1682
work page 2025
-
[27]
W. Huang, Y . Gu, Z. Wang, H. Zhu, and Y . Qian, “SpeechFake: A large-scale multilingual speech deepfake dataset incorporating cutting-edge generation methods,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 9985–9...
work page 2025
-
[28]
SafeEar: Content Privacy-Preserving Audio Deepfake Detection,
X. Li, K. Li, Y . Zheng, C. Yan, X. Ji, and W. Xu, “SafeEar: Content Privacy-Preserving Audio Deepfake Detection,” inProc. ACM CCS, 2024
work page 2024
-
[29]
Common voice: A massively-multilingual speech corpus,
R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, May 2020, pp. 4218–4222. [Online]. Available: http...
work page 2020
-
[30]
eSpeak NG: Open Source Speech Syn- thesizer,
eSpeak NG contributors, “eSpeak NG: Open Source Speech Syn- thesizer,” https://github.com/espeak-ng/espeak-ng, accessed: 2026-01-12
work page 2026
-
[31]
RHV oice: a free and open-source speech synthesizer,
RHV oice contributors, “RHV oice: a free and open-source speech synthesizer,” https://github.com/RHVoice/RHVoice, accessed: 2026-01-12
work page 2026
-
[32]
Aholab Signal Processing Laboratory, “AhoTTS,” https://github .com/aholab/AhoTTS, accessed: 2026-01-12
work page 2026
-
[33]
Silero Models: Text-to-Speech,
Silero Team, “Silero Models: Text-to-Speech,” https://github.com /snakers4/silero-models, 2026, accessed: 2026-01-12
work page 2026
-
[34]
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Lan- guage Processing,
J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y . Wu, S. Liu, T. Ko, Q. Li, Y . Zhang, Z. Wei, Y . Qian, J. Li, and F. Wei, “SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Lan- guage Processing,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5723–5738
work page 2022
-
[35]
Fastpitch: Parallel text-to-speech with pitch pre- diction,
A. Ła ´ncucki, “Fastpitch: Parallel text-to-speech with pitch pre- diction,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6588–6592
work page 2021
-
[36]
Matcha-TTS: A Fast TTS Architecture with Conditional Flow Matching,
S. Mehta, R. Tu, J. Beskow, ´E. Sz ´ekely, and G. E. Henter, “Matcha-TTS: A Fast TTS Architecture with Conditional Flow Matching,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024
work page 2024
-
[37]
Natural Language Guidance of High- Fidelity Text-to-Speech with Synthetic Annotations,
D. Lyth and S. King, “Natural Language Guidance of High- Fidelity Text-to-Speech with Synthetic Annotations,”arXiv preprint arXiv:2402.01912, 2024
-
[38]
Piper: A Fast, Local Neural Text-to-Speech System,
Rhasspy / Open Home Foundation V oice contributors, “Piper: A Fast, Local Neural Text-to-Speech System,” https://github.com/r hasspy/piper, 2026, accessed: 2026-01-12
work page 2026
-
[39]
MeloTTS: High-quality multi- lingual multi-accent text-to-speech,
W. Zhao, X. Yu, and Z. Qin, “MeloTTS: High-quality multi- lingual multi-accent text-to-speech,” GitHub repository, 2023. [Online]. Available: https://github.com/myshell-ai/MeloTTS
work page 2023
-
[40]
Scaling Speech Technology to 1000+ Languages,
V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y . Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scaling Speech Technology to 1000+ Languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024. [Online]. Available: https://jmlr.org/papers/v25/23-1318.html
work page 2024
-
[41]
Towards building text-to-speech systems for the next billion users,
G. K. Kumar, P. S V , P. Kumar, M. M. Khapra, and K. Nandaku- mar, “Towards building text-to-speech systems for the next billion users,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1– 5
work page 2023
-
[42]
Mul- tilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration,
R. Yeshpanov, S. Mussakhojayeva, and Y . Khassanov, “Mul- tilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration,” inInterspeech 2023, 2023, pp. 5521–5525
work page 2023
-
[43]
The IMS Toucan system for the Blizzard Challenge 2021,
F. Lux, J. Koch, A. Schweitzer, and N. Thang Vu, “The IMS Toucan system for the Blizzard Challenge 2021,” inThe Blizzard Challenge 2021, 2021, pp. 14–19
work page 2021
-
[44]
QirimtatarTTS: Text-to-Speech for Crimean Tatar,
Paniv, Yurii (robinhad), “QirimtatarTTS: Text-to-Speech for Crimean Tatar,” https://github.com/robinhad/qirimtatar- tts, accessed: 2026-01-12
work page 2026
-
[45]
XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,
E. Casanova, K. Davis, E. G ¨olge, G. G ¨oknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi, and J. We- ber, “XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,” inInterspeech 2024, 2024, pp. 4978–4982
work page 2024
- [46]
- [47]
-
[48]
Chatterbox: Open-Source Text-to- Speech Models,
Resemble AI developers, “Chatterbox: Open-Source Text-to- Speech Models,” https://github.com/resemble-ai/chatterbox, accessed: 2026-01-12
work page 2026
-
[49]
F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,
Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Association for Computational Linguistics, 2025. [Online]. Available: https://aclanthology.org/2025.ac...
work page 2025
-
[50]
Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Ma, Z. Gao, and Z. Yan, “CosyV oice: A Scal- able Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens,”arXiv preprint arXiv:2407.05407, 2024
work page internal anchor Pith review arXiv 2024
-
[51]
Zyphra developers, “Zonos-v0.1,” https://github.com/Zyphra/Zo nos, accessed: 2026-01-12
work page 2026
-
[52]
Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis,
S. Liao, Y . Wang, T. Li, Y . Cheng, R. Zhang, R. Zhou, and Y . Xing, “Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis,”arXiv preprint arXiv:2411.01156, 2024
-
[53]
Kokoro: Inference library for Kokoro- 82M,
hexgrad developers, “Kokoro: Inference library for Kokoro- 82M,” https://github.com/hexgrad/kokoro, accessed: 2026-01-12
work page 2026
-
[54]
K. Borodin, V . Kudryavtsev, D. Korzh, A. Efimenko, G. Mkrtchian, M. Gorodnichev, and O. Y . Rogov, “AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 48–55
work page 2024
-
[55]
Do compact ssl backbones matter for audio deepfake detection? a controlled study with raptor,
A. Kulkarni, S. Dowerah, A. Kulkarni, T. Alum ¨ae, and M. M. Doss, “Do compact ssl backbones matter for audio deepfake detection? a controlled study with raptor,” 2026. [Online]. Available: https://arxiv.org/abs/2603.06164
-
[56]
Capsule-based and tcn-based approaches for spoofing detection in voice biometry,
K. Borodin, V . Kudryavtsev, G. Mkrtchian, and M. Gorodnichev, “Capsule-based and tcn-based approaches for spoofing detection in voice biometry,”Engineering, Technology & Applied Science Research, vol. 14, no. 6, pp. 18 409–18 414, 2024
work page 2024
-
[57]
Audio deepfake detection with self-supervised xls-r and sls classifier,
Q. Zhang, S. Wen, and T. Hu, “Audio deepfake detection with self-supervised xls-r and sls classifier,” inProceedings of the 32nd ACM International Conference on Multimedia, ser. MM ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 6765–6773. [Online]. Available: https: //doi.org/10.1145/3664647.3681345
-
[58]
H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deep- fake detection using wav2vec 2.0 and data augmentation,” inThe Speaker and Language Recognition Workshop, 2022
work page 2022
-
[59]
Leveraging SSL Speech Features and Mamba for Enhanced DeepFake Detection,
H. M. Tran, D. Lolive, D. Guennec, A. Sini, A. Delhay, and P.- F. Marteau, “Leveraging SSL Speech Features and Mamba for Enhanced DeepFake Detection,” inInterspeech 2025, 2025, pp. 5323–5327
work page 2025
-
[60]
Nes2net: A lightweight nested architecture for foundation model driven speech anti-spoofing,
T. Liu, D.-T. Truong, R. Kumar Das, K. Aik Lee, and H. Li, “Nes2net: A lightweight nested architecture for foundation model driven speech anti-spoofing,”IEEE Transactions on Information Forensics and Security, vol. 20, pp. 12 005–12 018, 2025
work page 2025
-
[61]
wav2vec2-xls-r-1b DeepFake (AI4TRUST),
D. Combei, “wav2vec2-xls-r-1b DeepFake (AI4TRUST),” https: //huggingface.co/DavidCombei/wav2vec2-xls-r-1b-DeepFake-A I4TRUST, 2025, accessed: 2026-01-28
work page 2025
-
[62]
wav2vec2-xls-r-300m deepfake V1,
——, “wav2vec2-xls-r-300m deepfake V1,” https://huggingfac e.co/DavidCombei/wav2vec2-xls-r-300m-deepfake-V1, 2025, accessed: 2026-01-28
work page 2025
-
[63]
Does Audio Deep- fake Detection Generalize?
Nicolas M ¨uller and Pavel Czempin and Franziska Diekmann and Adam Froghyar and Konstantin B ¨ottinger, “Does Audio Deep- fake Detection Generalize?” inInterspeech 2022, 2022, pp. 2783– 2787
work page 2022
-
[64]
DFADD: The Diffusion and Flow- Matching Based Audio Deepfake Dataset,
J. Du, I.-M. Lin, I.-H. Chiu, X. Chen, H. Wu, W. Ren, Y . Tsao, H.-y. Lee, and J.-S. R. Jang, “DFADD: The Diffusion and Flow- Matching Based Audio Deepfake Dataset,” inProc. IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 921– 928
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.