arxiv: 2603.02364 · v3 · submitted 2026-03-02 · 💻 cs.SD · eess.AS

Recognition: 2 theorem links

· Lean Theorem

When Spoof Detectors Travel: Evaluation Across 66 Languages in the Low-Resource Language Spoofing Corpus

Kirill Borodin , Vasiliy Kudryavtsev , Maxim Maslov , Mikhail Gorodnichev , Grach Mkrtchian

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:24 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords spoof detectioncross-lingual evaluationsynthetic speechlow-resource languagesdomain shiftcountermeasuresTTS systemsthreshold transfer

0 comments

The pith

Spoof detectors exhibit large performance gaps across 66 languages when thresholds are transferred from external data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds LRLspoof, a corpus of 2732 hours of synthetic speech produced by 24 TTS systems in 66 languages, 45 of them low-resource. Eleven public countermeasures are tested by first fixing each model's decision threshold on pooled external benchmarks and then applying that same threshold to the new corpus. This yields a spoof rejection rate for each language without any genuine speech from the target languages. The measurements reveal clear differences in how well each model rejects spoofs, with the size of the gap depending on both the model and the language. The pattern indicates that language introduces its own domain shift separate from other acoustic factors.

Core claim

When 11 countermeasures are evaluated on the LRLspoof corpus using an EER operating point calibrated on external pooled data, spoof rejection rates vary markedly across the 66 languages in a model-dependent manner, demonstrating that language functions as an independent source of domain shift for spoof detection.

What carries the argument

Threshold transfer: calibrate an equal-error-rate operating point on pooled external benchmarks, then apply the resulting fixed threshold to compute spoof rejection rate on the new multilingual corpus without target-domain bonafide speech.

If this is right

Spoof detectors trained or tuned on high-resource languages cannot be assumed to deliver uniform protection when deployed on low-resource languages.
Model rankings based on average performance across languages can mask large per-language failures that matter for global applications.
The LRLspoof corpus supplies a public benchmark that future countermeasures can use to measure cross-lingual robustness without collecting local bonafide data.
Language-specific evaluation protocols become necessary rather than optional for any spoof detector intended for multilingual use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Voice-authentication systems intended for worldwide use may require per-language threshold adaptation or additional language-aware training to keep security levels consistent.
The observed disparity parallels domain-shift problems already documented in speech recognition and speaker verification when moving across languages or dialects.
Extending the evaluation to include real-world replay and voice-conversion attacks rather than only TTS output would test whether the language effect persists outside controlled synthesis.
Corpus expansion with matched bonafide recordings in a subset of the 66 languages would allow direct comparison of the threshold-transfer method against conventional EER measurement.

Load-bearing premise

A single threshold chosen on external benchmarks remains a valid and unbiased performance measure when applied to new languages that supply no genuine speech for recalibration.

What would settle it

Re-running the same threshold-transfer protocol on the 66-language corpus and finding nearly identical spoof rejection rates for multiple models across all languages would show that language does not act as an independent domain shift.

Figures

Figures reproduced from arXiv: 2603.02364 by Grach Mkrtchian, Kirill Borodin, Maxim Maslov, Mikhail Gorodnichev, Vasiliy Kudryavtsev.

read the original abstract

We introduce LRLspoof, a large-scale multilingual synthetic-speech corpus for cross-lingual spoof detection, comprising 2,732 hours of audio generated with 24 open-source TTS systems across 66 languages, including 45 low-resource languages under our operational definition. To evaluate robustness without requiring target-domain bonafide speech, we benchmark 11 publicly available countermeasures using threshold transfer: for each model we calibrate an EER operating point on pooled external benchmarks and apply the resulting threshold, reporting spoof rejection rate (SRR). Results show model-dependent cross-lingual disparity, with spoof rejection varying markedly across languages even under controlled conditions, highlighting language as an independent source of domain shift in spoof detection. The dataset is publicly available at \href{https://huggingface.co/datasets/lab260/LRLspoof}{\textbf{\underline{\textit{HuggingFace}}}} and \href{https://modelscope.cn/datasets/lab260/LRLspoof}{\textbf{\underline{\textit{ModelScope}}}}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LRLspoof gives us a genuinely large new corpus for low-resource language spoof detection, but the cross-lingual disparity results rest on an untested threshold-transfer assumption that could mix bonafide and spoof shifts.

read the letter

The paper's clearest value is the LRLspoof corpus itself: 2,732 hours across 66 languages, 45 of them low-resource by their definition, generated with 24 open-source TTS systems. Releasing it publicly on Hugging Face and ModelScope is a concrete step that lets other groups run their own tests without starting from scratch. That scale and coverage go beyond existing spoof datasets and directly addresses the gap in low-resource settings for voice security work. The evaluation of 11 public countermeasures via threshold transfer from external EER calibration is a reasonable way to proceed when target-domain bonafide speech is unavailable, and the reported model-dependent variation in spoof rejection rates is worth noting. The SRR metric is defined cleanly from the external threshold, avoiding direct fitting to the new data. The soft spot is the core assumption behind the disparity claim. Applying one fixed threshold across languages assumes that bonafide score distributions stay roughly stable; if they shift with language or TTS quality, then differences in SRR could reflect changes in the effective operating point rather than pure spoof-detection robustness. The abstract and stress-test note both flag the lack of target bonafide controls, so the language-as-independent-domain-shift conclusion stays provisional until someone can check score distributions directly. This work is aimed at researchers building or testing anti-spoofing systems for global deployment. Anyone working on multilingual audio or low-resource robustness will get immediate use from the corpus even if they rerun the numbers themselves. It deserves a serious referee because the data contribution is substantial and the evaluation questions are practical ones the field needs to settle.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the LRLspoof corpus comprising 2,732 hours of synthetic speech from 24 open-source TTS systems across 66 languages (45 low-resource). It evaluates 11 public countermeasures via threshold transfer: an EER operating point is calibrated on pooled external benchmarks and applied to compute spoof rejection rate (SRR) on the new corpus, revealing model-dependent cross-lingual performance disparities that the authors attribute to language as an independent domain shift.

Significance. If the threshold-transfer protocol is shown to maintain a consistent operating point, the work supplies a large-scale public resource for cross-lingual spoof detection research and supplies concrete evidence that language-induced score shifts can dominate detector behavior even under controlled synthesis conditions. The open release on Hugging Face and ModelScope is a clear asset for reproducibility.

major comments (2)

[Evaluation protocol (threshold transfer procedure)] The central claim of model-dependent cross-lingual disparity in SRR rests on applying a single EER-derived threshold calibrated on external pooled benchmarks. Because no target-domain bonafide speech is available, the paper cannot verify that this threshold preserves a constant false-acceptance rate across languages; any language-specific shift in bonafide score statistics would move the effective decision boundary, so reported SRR differences may conflate changes in spoof and bonafide distributions.
[Abstract and §4 (Experiments)] The manuscript supplies no details on model-selection criteria for the 11 countermeasures, the precise EER calibration procedure on the external benchmarks (e.g., which corpora, how scores are pooled, exact EER computation), or any controls for TTS quality variation across the 66 languages. These omissions limit independent verification of the disparity claim.

minor comments (1)

[Abstract] The dataset citation links in the abstract are functional but could be repeated in the main text with DOIs or persistent identifiers for easier access.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity and completeness where possible.

read point-by-point responses

Referee: [Evaluation protocol (threshold transfer procedure)] The central claim of model-dependent cross-lingual disparity in SRR rests on applying a single EER-derived threshold calibrated on external pooled benchmarks. Because no target-domain bonafide speech is available, the paper cannot verify that this threshold preserves a constant false-acceptance rate across languages; any language-specific shift in bonafide score statistics would move the effective decision boundary, so reported SRR differences may conflate changes in spoof and bonafide distributions.

Authors: We agree that the lack of target-domain bonafide speech precludes direct confirmation that the transferred threshold maintains a constant false-acceptance rate across languages. The threshold-transfer protocol was deliberately selected to reflect realistic low-resource deployment conditions where bonafide calibration data are unavailable. SRR therefore measures the fraction of synthetic utterances rejected at the externally derived EER operating point, capturing the net effect of language-induced shifts on spoof scores. In the revised manuscript we have added a dedicated paragraph in §4.3 that explicitly discusses the possibility of bonafide score shifts as a confounding factor and states that the reported disparities reflect the combined influence on detection performance under transferred thresholds. We have also added this limitation to the conclusions. revision: partial
Referee: [Abstract and §4 (Experiments)] The manuscript supplies no details on model-selection criteria for the 11 countermeasures, the precise EER calibration procedure on the external benchmarks (e.g., which corpora, how scores are pooled, exact EER computation), or any controls for TTS quality variation across the 66 languages. These omissions limit independent verification of the disparity claim.

Authors: We thank the referee for highlighting these omissions. In the revised version we have expanded §3.2 to describe the model-selection criteria: the 11 countermeasures were chosen for public availability, architectural diversity (CNN, ResNet, and transformer variants), and documented performance on prior ASVspoof challenges. Section 4.1 now specifies the calibration procedure in full: scores from the ASVspoof 2019 LA and ASVspoof 2021 LA evaluation partitions were pooled, the EER threshold was computed as the operating point equating false-acceptance and false-rejection rates on the pooled score distribution, and the exact percentile-based implementation is provided. We have also added an appendix section that controls for TTS quality by reporting SRR stratified by synthesis system and by language-resource category, using available objective quality predictors. These additions enable independent reproduction and verification of the disparity results. revision: yes

Circularity Check

0 steps flagged

No circularity: threshold transfer uses independent external calibration

full rationale

The paper defines SRR by applying an EER threshold calibrated exclusively on pooled external benchmarks (independent of LRLspoof) to the new corpus's spoof scores. This produces a direct empirical measurement of rejection rates across languages rather than any self-referential fit, prediction from fitted parameters, or reduction to the target data by construction. The observed cross-lingual SRR disparities are reported outcomes of this fixed-threshold procedure; they do not loop back to redefine the threshold or the corpus itself. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that external EER calibration transfers meaningfully to unseen languages and that the 24 TTS systems produce representative spoofs.

free parameters (1)

EER operating-point threshold
Calibrated once on pooled external benchmarks and then frozen for all 66 languages.

axioms (1)

domain assumption External EER threshold remains appropriate for new languages without target bonafide data
Invoked to justify the threshold-transfer protocol described in the abstract.

pith-pipeline@v0.9.0 · 5498 in / 1229 out tokens · 65928 ms · 2026-05-15T16:24:59.684925+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we calibrate an EER operating point on pooled external benchmarks and apply the resulting threshold, reporting spoof rejection rate (SRR)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results show model-dependent cross-lingual disparity... language as an independent source of domain shift

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 3 internal anchors

[1]

Introduction Recent progress in text-to-speech (TTS) and voice conversion (VC) has made audio spoofing increasingly practical, raising the stakes for speaker verification and other speech-driven security applications [1]. To keep evaluation reproducible and compa- rable as attacks evolve, the community has developed shared tasks and benchmarks with standa...

work page
[2]

When Spoof Detectors Travel: Evaluation Across 66 Languages in the Low-Resource Language Spoofing Corpus

Related works Table 1:Comparison of representative spoofing corpora. L = number of languages, LRL = number of low-resource languages (per our operational definition). “Models” denotes the number of distinct speech generation systems used to create audio as reported by each dataset. “Hours” denotes the reported dura- tion of the speech data. Some entries a...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

spans many languages and generation systems, but its low- resource coverage is smaller by the same criterion, which can constrain analyses centered on low-resource conditions. In contrast, our corpus is purpose-built for controlled cross- lingual spoof detection under explicit (language, synthesizer) shifts: we generate spoofed speech using a fixed suite ...

work page
[4]

The corpus contains only syntheti- cally generated speech produced with a fixed set of open-source TTS synthesizers across 66 languages

Dataset Creation We constructed a multilingual synthetic-speech corpus for spoof detection research. The corpus contains only syntheti- cally generated speech produced with a fixed set of open-source TTS synthesizers across 66 languages. We include widely used languages alongside many low-resource languages to facilitate controlled cross-lingual generaliz...

work page
[5]

Experimental setup 4.1. Spoofing countermeasures We evaluate 11 publicly available spoofing CMs span- ning classical spectro-temporal architectures and large self- supervised encoders:aasist3[46],df arena 1b[47], df arena 500[47],res2tcn[48],rescapsguard[48], sls[49],ssl aasist[50],tcm add[51],nes2net[52], w2v2 1b[53], andw2v2 300[54]. 4.2. Spoof-only eva...

work page
[6]

Results and Discussion 5.1. Overall robustness under threshold transfer We first summarize overall spoof rejection performance when transferring EER-calibrated thresholds from pooled external benchmarks to our corpus. Table 2 shows that threshold trans- fer can yield widely varying spoof rejection rates (SRR) across Table 2:Spoof rejection rate (SRR, %) a...

work page arXiv
[7]

Conclusion Using the proposed LRLspoof corpus, a 2,732 hours of spoofed- only speech from 24 open-source TTS systems spanning 66 lan- guages, we evaluated 11 public CMs at a fixed EER-calibrated operating point set on pooled external benchmarks, and tested them without adaptation across all language and synthesizer subsets. The results suggest that many C...

work page
[8]

Generative AI Use Disclosure This work uses generative models as part of the data creation pipeline: portions of the dataset were synthesized using text- to-speech (TTS) systems to produce spoofed (synthetic) speech samples for anti-spoofing research. Generative AI tools were not used to develop the core scientific contributions beyond this disclosed data...

work page
[9]

A Survey on Speech Deepfake Detection,

M. Li, Y . Ahmadiadli, and X.-P. Zhang, “A Survey on Speech Deepfake Detection,”ACM Computing Surveys, vol. 57, no. 7, pp. 1–38, 2025

work page 2025
[10]

ASVspoof 2019: Future Horizons in Spoofed and Fake Au- dio Detection,

M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. H. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future Horizons in Spoofed and Fake Au- dio Detection,” inInterspeech 2019, 2019, pp. 1008–1012

work page 2019
[11]

Asvspoof 2021: Towards spoofed and deepfake speech de- tection in the wild,

X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kin- nunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, and K. A. Lee, “Asvspoof 2021: Towards spoofed and deepfake speech de- tection in the wild,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2507–2522, 2023

work page 2021
[12]

Asvspoof 5: Design, collection and validation of resources for spoofing, deepfake, and adversarial attack detection using crowdsourced speech,

X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunen, N. Evans, K. A. Lee, J. Yamagishi, M. Jeong, G. Zhu, Y . Zang, Y . Zhang, S. Maiti, F. Lux, N. M¨uller, W. Zhang, C. Sun, S. Hou, S. Lyu, S. Le Maguer, C. Gong, H. Guo, L. Chen, and V . Singh, “Asvspoof 5: Design, collection and validation o...

work page 2026
[13]

Add 2022: the first audio deep synthe- sis detection challenge,

J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y . Bai, C. Fan, S. Liang, S. Wang, S. Zhang, X. Yan, L. Xu, Z. Wen, and H. Li, “Add 2022: the first audio deep synthe- sis detection challenge,” inICASSP 2022 - 2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 9216–9220

work page 2022
[14]

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset,

N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G ¨olge, T. M ¨uller, P. Syga, P. Sperl, and K. B ¨ottinger, “MLAAD: The Multi-Language Audio Anti-Spoofing Dataset,” inProc. Interna- tional Joint Conference on Neural Networks (IJCNN), 2024, pp. 1–7

work page 2024
[15]

IndicSynth: A Large- Scale Multilingual Synthetic Speech Dataset for Low-Resource Indian Languages,

D. V . Sharma, V . Ekbote, and A. Gupta, “IndicSynth: A Large- Scale Multilingual Synthetic Speech Dataset for Low-Resource Indian Languages,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 22 037–22 060

work page 2025
[16]

Ai-synthesized voice detection using neural vocoder artifacts,

C. Sun, S. Jia, S. Hou, and S. Lyu, “Ai-synthesized voice detection using neural vocoder artifacts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. Computer Vision Foundation, 2023

work page 2023
[17]

SoundStream: An End-to-End Neural Audio Codec,

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An End-to-End Neural Audio Codec,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 30, pp. 495–507, 2022

work page 2022
[18]

High Fidelity Neural Audio Compression

A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High Fidelity Neural Audio Compression,”Transactions on Machine Learning Research (TMLR), 2022. [Online]. Available: https: //arxiv.org/abs/2210.13438

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Raw- Boost: A Raw Data Boosting and Augmentation Method Applied to Automatic Speaker Verification Anti-Spoofing,

H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Raw- Boost: A Raw Data Boosting and Augmentation Method Applied to Automatic Speaker Verification Anti-Spoofing,” inProc. IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2022

work page 2022
[20]

Adversarial Attacks on Spoofing Countermeasures of Automatic Speaker Verification,

S. Liu, H. Wu, H.-y. Lee, and H. Meng, “Adversarial Attacks on Spoofing Countermeasures of Automatic Speaker Verification,” inProc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 312–319

work page 2019
[21]

Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing,

T. Liu, I. Kukanov, Z. Pan, Q. Wang, H. B. Sailor, and K. A. Lee, “Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing,” inProc. IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 1185–1192

work page 2024
[22]

Revealing Cross-Lingual Bias in Syn- thetic Speech Detection under Controlled Conditions,

V . Moreno, J. Lima, F. Sim ˜oes, R. Violato, M. Uliani Neto, F. Runstein, and P. Costa, “Revealing Cross-Lingual Bias in Syn- thetic Speech Detection under Controlled Conditions,” inProc. 5th Symposium on Security and Privacy in Speech Communica- tion (SPSC), 2025

work page 2025
[23]

Un- masking real-world audio deepfakes: A data-centric approach,

D. Combei, A. Stan, D. Oneata, N. M ¨uller, and H. Cucu, “Un- masking real-world audio deepfakes: A data-centric approach,” in Interspeech 2025, 2025, pp. 5343–5347

work page 2025
[24]

MLADDC: Multi-lingual audio deepfake detection corpus,

A. J. Shah, R. M. Purohit, D. H. Vaghera, and H. Patil, “MLADDC: Multi-lingual audio deepfake detection corpus,” inAudio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation, 2024. [Online]. Available: https://openreview.net/forum?id=ic3HvoOTeU

work page 2024
[25]

SynHate: De- tecting Hate Speech in Synthetic Deepfake Audio,

R. Ranjan, K. Pipariya, M. Vatsa, and R. Singh, “SynHate: De- tecting Hate Speech in Synthetic Deepfake Audio,” inProc. Inter- speech, 2025, pp. 5623–5627

work page 2025
[26]

Multimodal Zero- Shot Framework for Deepfake Hate Speech Detection in Low- Resource Languages,

R. Ranjan, L. Ayinala, M. Vatsa, and R. Singh, “Multimodal Zero- Shot Framework for Deepfake Hate Speech Detection in Low- Resource Languages,” inInterspeech 2025, 2025, pp. 1678–1682

work page 2025
[27]

SpeechFake: A large-scale multilingual speech deepfake dataset incorporating cutting-edge generation methods,

W. Huang, Y . Gu, Z. Wang, H. Zhu, and Y . Qian, “SpeechFake: A large-scale multilingual speech deepfake dataset incorporating cutting-edge generation methods,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 9985–9...

work page 2025
[28]

SafeEar: Content Privacy-Preserving Audio Deepfake Detection,

X. Li, K. Li, Y . Zheng, C. Yan, X. Ji, and W. Xu, “SafeEar: Content Privacy-Preserving Audio Deepfake Detection,” inProc. ACM CCS, 2024

work page 2024
[29]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, May 2020, pp. 4218–4222. [Online]. Available: http...

work page 2020
[30]

eSpeak NG: Open Source Speech Syn- thesizer,

eSpeak NG contributors, “eSpeak NG: Open Source Speech Syn- thesizer,” https://github.com/espeak-ng/espeak-ng, accessed: 2026-01-12

work page 2026
[31]

RHV oice: a free and open-source speech synthesizer,

RHV oice contributors, “RHV oice: a free and open-source speech synthesizer,” https://github.com/RHVoice/RHVoice, accessed: 2026-01-12

work page 2026
[32]

Aholab Signal Processing Laboratory, “AhoTTS,” https://github .com/aholab/AhoTTS, accessed: 2026-01-12

work page 2026
[33]

Silero Models: Text-to-Speech,

Silero Team, “Silero Models: Text-to-Speech,” https://github.com /snakers4/silero-models, 2026, accessed: 2026-01-12

work page 2026
[34]

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Lan- guage Processing,

J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y . Wu, S. Liu, T. Ko, Q. Li, Y . Zhang, Z. Wei, Y . Qian, J. Li, and F. Wei, “SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Lan- guage Processing,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5723–5738

work page 2022
[35]

Fastpitch: Parallel text-to-speech with pitch pre- diction,

A. Ła ´ncucki, “Fastpitch: Parallel text-to-speech with pitch pre- diction,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6588–6592

work page 2021
[36]

Matcha-TTS: A Fast TTS Architecture with Conditional Flow Matching,

S. Mehta, R. Tu, J. Beskow, ´E. Sz ´ekely, and G. E. Henter, “Matcha-TTS: A Fast TTS Architecture with Conditional Flow Matching,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

work page 2024
[37]

Natural Language Guidance of High- Fidelity Text-to-Speech with Synthetic Annotations,

D. Lyth and S. King, “Natural Language Guidance of High- Fidelity Text-to-Speech with Synthetic Annotations,”arXiv preprint arXiv:2402.01912, 2024

work page arXiv 2024
[38]

Piper: A Fast, Local Neural Text-to-Speech System,

Rhasspy / Open Home Foundation V oice contributors, “Piper: A Fast, Local Neural Text-to-Speech System,” https://github.com/r hasspy/piper, 2026, accessed: 2026-01-12

work page 2026
[39]

MeloTTS: High-quality multi- lingual multi-accent text-to-speech,

W. Zhao, X. Yu, and Z. Qin, “MeloTTS: High-quality multi- lingual multi-accent text-to-speech,” GitHub repository, 2023. [Online]. Available: https://github.com/myshell-ai/MeloTTS

work page 2023
[40]

Scaling Speech Technology to 1000+ Languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y . Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scaling Speech Technology to 1000+ Languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024. [Online]. Available: https://jmlr.org/papers/v25/23-1318.html

work page 2024
[41]

Towards building text-to-speech systems for the next billion users,

G. K. Kumar, P. S V , P. Kumar, M. M. Khapra, and K. Nandaku- mar, “Towards building text-to-speech systems for the next billion users,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1– 5

work page 2023
[42]

Mul- tilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration,

R. Yeshpanov, S. Mussakhojayeva, and Y . Khassanov, “Mul- tilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration,” inInterspeech 2023, 2023, pp. 5521–5525

work page 2023
[43]

The IMS Toucan system for the Blizzard Challenge 2021,

F. Lux, J. Koch, A. Schweitzer, and N. Thang Vu, “The IMS Toucan system for the Blizzard Challenge 2021,” inThe Blizzard Challenge 2021, 2021, pp. 14–19

work page 2021
[44]

QirimtatarTTS: Text-to-Speech for Crimean Tatar,

Paniv, Yurii (robinhad), “QirimtatarTTS: Text-to-Speech for Crimean Tatar,” https://github.com/robinhad/qirimtatar- tts, accessed: 2026-01-12

work page 2026
[45]

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,

E. Casanova, K. Davis, E. G ¨olge, G. G ¨oknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi, and J. We- ber, “XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,” inInterspeech 2024, 2024, pp. 4978–4982

work page 2024
[46]

XTTS-v2,

Coqui developers, “XTTS-v2,” https://huggingface.co/coqui/X TTS-v2, 2026, accessed: 2026-01-12

work page 2026
[47]

OuteTTS,

edwko developers, “OuteTTS,” https://github.com/edwko/OuteT TS, accessed: 2026-01-12

work page 2026
[48]

Chatterbox: Open-Source Text-to- Speech Models,

Resemble AI developers, “Chatterbox: Open-Source Text-to- Speech Models,” https://github.com/resemble-ai/chatterbox, accessed: 2026-01-12

work page 2026
[49]

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Association for Computational Linguistics, 2025. [Online]. Available: https://aclanthology.org/2025.ac...

work page 2025
[50]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Ma, Z. Gao, and Z. Yan, “CosyV oice: A Scal- able Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens,”arXiv preprint arXiv:2407.05407, 2024

work page internal anchor Pith review arXiv 2024
[51]

Zonos-v0.1,

Zyphra developers, “Zonos-v0.1,” https://github.com/Zyphra/Zo nos, accessed: 2026-01-12

work page 2026
[52]

Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis,

S. Liao, Y . Wang, T. Li, Y . Cheng, R. Zhang, R. Zhou, and Y . Xing, “Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis,”arXiv preprint arXiv:2411.01156, 2024

work page arXiv 2024
[53]

Kokoro: Inference library for Kokoro- 82M,

hexgrad developers, “Kokoro: Inference library for Kokoro- 82M,” https://github.com/hexgrad/kokoro, accessed: 2026-01-12

work page 2026
[54]

AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge,

K. Borodin, V . Kudryavtsev, D. Korzh, A. Efimenko, G. Mkrtchian, M. Gorodnichev, and O. Y . Rogov, “AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 48–55

work page 2024
[55]

Do compact ssl backbones matter for audio deepfake detection? a controlled study with raptor,

A. Kulkarni, S. Dowerah, A. Kulkarni, T. Alum ¨ae, and M. M. Doss, “Do compact ssl backbones matter for audio deepfake detection? a controlled study with raptor,” 2026. [Online]. Available: https://arxiv.org/abs/2603.06164

work page arXiv 2026
[56]

Capsule-based and tcn-based approaches for spoofing detection in voice biometry,

K. Borodin, V . Kudryavtsev, G. Mkrtchian, and M. Gorodnichev, “Capsule-based and tcn-based approaches for spoofing detection in voice biometry,”Engineering, Technology & Applied Science Research, vol. 14, no. 6, pp. 18 409–18 414, 2024

work page 2024
[57]

Audio deepfake detection with self-supervised xls-r and sls classifier,

Q. Zhang, S. Wen, and T. Hu, “Audio deepfake detection with self-supervised xls-r and sls classifier,” inProceedings of the 32nd ACM International Conference on Multimedia, ser. MM ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 6765–6773. [Online]. Available: https: //doi.org/10.1145/3664647.3681345

work page doi:10.1145/3664647.3681345 2024
[58]

Automatic speaker verification spoofing and deep- fake detection using wav2vec 2.0 and data augmentation,

H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deep- fake detection using wav2vec 2.0 and data augmentation,” inThe Speaker and Language Recognition Workshop, 2022

work page 2022
[59]

Leveraging SSL Speech Features and Mamba for Enhanced DeepFake Detection,

H. M. Tran, D. Lolive, D. Guennec, A. Sini, A. Delhay, and P.- F. Marteau, “Leveraging SSL Speech Features and Mamba for Enhanced DeepFake Detection,” inInterspeech 2025, 2025, pp. 5323–5327

work page 2025
[60]

Nes2net: A lightweight nested architecture for foundation model driven speech anti-spoofing,

T. Liu, D.-T. Truong, R. Kumar Das, K. Aik Lee, and H. Li, “Nes2net: A lightweight nested architecture for foundation model driven speech anti-spoofing,”IEEE Transactions on Information Forensics and Security, vol. 20, pp. 12 005–12 018, 2025

work page 2025
[61]

wav2vec2-xls-r-1b DeepFake (AI4TRUST),

D. Combei, “wav2vec2-xls-r-1b DeepFake (AI4TRUST),” https: //huggingface.co/DavidCombei/wav2vec2-xls-r-1b-DeepFake-A I4TRUST, 2025, accessed: 2026-01-28

work page 2025
[62]

wav2vec2-xls-r-300m deepfake V1,

——, “wav2vec2-xls-r-300m deepfake V1,” https://huggingfac e.co/DavidCombei/wav2vec2-xls-r-300m-deepfake-V1, 2025, accessed: 2026-01-28

work page 2025
[63]

Does Audio Deep- fake Detection Generalize?

Nicolas M ¨uller and Pavel Czempin and Franziska Diekmann and Adam Froghyar and Konstantin B ¨ottinger, “Does Audio Deep- fake Detection Generalize?” inInterspeech 2022, 2022, pp. 2783– 2787

work page 2022
[64]

DFADD: The Diffusion and Flow- Matching Based Audio Deepfake Dataset,

J. Du, I.-M. Lin, I.-H. Chiu, X. Chen, H. Wu, W. Ren, Y . Tsao, H.-y. Lee, and J.-S. R. Jang, “DFADD: The Diffusion and Flow- Matching Based Audio Deepfake Dataset,” inProc. IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 921– 928

work page 2024