arxiv: 2604.02374 · v2 · submitted 2026-03-31 · 💻 cs.SD

Recognition: unknown

Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative

Ksenia Lysikova , Kirill Borodin , Grach Mkrtchian

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:25 UTC · model gemini-3-flash-preview

classification 💻 cs.SD

keywords anti-spoofingvoice cloningRussian languagespeech synthesisdatasetrobustnessbiometric security

0 comments

The pith

A new benchmark tests if AI can spot fake Russian speech in the real world.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a standardized way to test whether software can distinguish between real human voices and AI-generated clones in the Russian language. Existing tools often fail when they encounter new types of synthetic voices or when the audio quality is degraded by background noise or phone line compression. By creating a dataset that includes 37 different voice-synthesis systems and simulating various environmental distortions, the authors provide a way to stress-test security systems. This matters because as voice-cloning technology improves, detection systems must be able to handle 'in-the-wild' conditions rather than just clean laboratory samples.

Core claim

The authors demonstrate that current anti-spoofing models struggle significantly when faced with 'unseen' synthesis techniques—those not present during the model's training—and when audio is passed through realistic communication channels like codecs or reverb. They find that while large, pre-trained models generally perform better, no single architecture is currently immune to the performance drops caused by common audio distortions. The RuASD dataset serves as a diagnostic tool to identify these specific failure points in Russian-language security applications.

What carries the argument

The RuASD (Russian Anti-Spoofing Dataset) pipeline, which integrates 37 distinct text-to-speech and voice-cloning systems with a 'perturbation chain' that simulates real-world audio degradation like room echo and digital compression.

If this is right

Security developers can now measure exactly how much their voice-authentication systems degrade when a user calls from a noisy environment.
The high diversity of synthesis systems sets a new standard for what generalization means in voice-spoofing detection.
Future detection models will likely need to incorporate channel-aware training to remain effective against modern digital transmission.
Researchers can compare the effectiveness of lightweight models versus large-scale pre-trained systems on a level playing field for the Russian language.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gap between laboratory performance and real-world deployment suggests that current performance claims in voice security may be significantly overoptimistic.
The methodology used here for Russian could be adapted to create similar stress-test benchmarks for lower-resource languages where voice-cloning threats are emerging.
If the simulated distortions in the dataset become a standard training target, attackers may shift toward high-fidelity spoofs that exploit the detection model's over-reliance on identifying channel noise.

Load-bearing premise

The simulated distortions, such as added noise and digital compression, are assumed to be a perfect match for the messy audio quality found in actual fraudulent phone calls.

What would settle it

If a detection model that performs perfectly on the RuASD benchmark consistently fails to catch actual fraudulent Russian voice-clones in a live banking or security environment, then the benchmark's simulated conditions are not sufficiently representative of real-world attacks.

Figures

Figures reproduced from arXiv: 2604.02374 by Grach Mkrtchian, Kirill Borodin, Ksenia Lysikova.

**Figure 1.** Figure 1: FIGURE 1 view at source ↗

**Figure 2.** Figure 2: FIGURE 2 view at source ↗

**Figure 3.** Figure 3: FIGURE 3 view at source ↗

read the original abstract

RuASD (Russian AntiSpoofing Dataset) is a dedicated, reproducible benchmark for Russian-language speech anti-spoofing designed to evaluate both in-domain discrimination and robustness to deployment-style distribution shifts. It combines a large spoof subset synthesized using 37 modern Russian-capable TTS and voice-cloning systems with a bona fide subset curated from multiple heterogeneous open Russian speech corpora, enabling systematic evaluation across diverse data sources. To emulate typical dissemination and channel effects in a controlled and reproducible manner, RuASD includes configurable simulations of platform and transmission distortions, including room reverberation, additive noise/music, and a range of speech-codec transcodings implemented via a unified processing chain. We benchmark a diverse set of publicly available anti-spoofing countermeasures spanning lightweight supervised architectures, graph-attention models, SSL-based detectors, and large-scale pretrained systems, and report reference results on both clean and simulated conditions to characterize robustness under realistic perturbation pipelines. The dataset is publickly available at \href{https://huggingface.co/datasets/MTUCI/RuASD}{\underline{Hugging Face}} and \href{https://modelscope.cn/datasets/lab260/RuASD}{\underline{ModelScope}}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RuASD provides a solid, much-needed Russian-language anti-spoofing benchmark, though the high performance on clean data likely benefits from channel shortcuts.

read the letter

This paper introduces RuASD, a dataset designed to fix the lack of standardized anti-spoofing benchmarks for Russian speech. The punchline is that it’s a high-quality, reproducible resource covering 37 different synthesis and voice conversion systems, making it the most comprehensive tool for this language to date.

The authors did a lot of things right. They didn't just wrap existing data; they built a pipeline for 37 synthesis systems and integrated diverse bona fide sources like Golos and ROSC. They also shipped the code and data on Hugging Face, which is exactly what we need for regional security research. The inclusion of a robustness evaluation—simulating noise, reverb, and codecs—is a pragmatic move that reflects how these systems actually fail in the wild.

The main soft spot is the risk of channel leakage. Because the bona fide speech comes from existing corpora with specific microphone and room characteristics, and the synthetic spoofs are initially 'digitally clean,' the models might be learning to identify the source corpus rather than the synthesis artifacts. The extremely low EERs (like 0.82% for XLS-R) on the clean set suggest the models found a shortcut. The authors attempt to mitigate this with their distortion pipeline, but the gap between 'natural' corpus noise and 'synthetic' clean signals remains a common hurdle in this field. I would view the 'Clean' results as an upper bound and focus more on the distorted/robustness metrics.

This paper is for anyone working on speech biometrics or non-English security infrastructure. It is a serious, well-constructed effort that provides a necessary baseline for the Russian-speaking domain. It definitely deserves a peer-reviewed slot.

I recommend moving forward with it.

Referee Report

2 major / 4 minor

Summary. The paper introduces the Russian AntiSpoofing Dataset (RuASD), a benchmark designed to evaluate the generalization and robustness of anti-spoofing systems for the Russian language. The dataset addresses the linguistic gap in existing benchmarks by incorporating 37 different Russian-capable TTS and VC systems. It also includes a standardized processing chain to simulate realistic channel distortions like reverberation, noise, and codec compression. The authors benchmark several modern architectures (AASIST, RawNet2, XLS-R, etc.) across clean and distorted conditions, providing a baseline for future research in Russian speech security.

Significance. RuASD represents a significant contribution to the field of speech anti-spoofing, particularly for non-English languages. The inclusion of 37 distinct synthesis systems provides high diversity, which is critical for testing generalization beyond specific generator artifacts. The paper's strength lies in its commitment to reproducibility, providing both the dataset and the augmentation pipeline via public repositories (Hugging Face/ModelScope). By formalizing a 'Robustness' evaluation set with controlled perturbations, the authors move the field toward more realistic performance metrics than standard 'clean' evaluations.

major comments (2)

[§3.1, §3.2, and Table 2] There is a significant risk of 'channel leakage' or shortcut learning. The bona fide data is sourced from three heterogeneous corpora (ROSC, Golos, Sber-devices) recorded via microphones in various environments, whereas the spoofed data is synthesized using neural TTS/VC systems, which are inherently 'clean' at the point of generation. In Table 2, the extremely low EER for XLS-R (0.82% on the 'Clean' set) may reflect the model's ability to distinguish the background acoustic signature (microphone noise, room tone) of the original bona fide corpora from the silence/digital-perfection of the spoofed systems, rather than detecting synthesis artifacts. The authors should provide an analysis (e.g., an evaluation on silent segments or a comparison of noise floors) to ensure the benchmark is measuring anti-spoofing rather than channel detection.
[§4] The partitioning strategy requires more detail regarding the 'diverse' synthesis systems. While the paper states that speakers are disjoint across sets, it is unclear if specific TTS/VC systems are also disjoint. To evaluate true generalization to 'unseen' attacks, the Evaluation set should ideally contain output from synthesis architectures not present in the Training/Dev sets. If there is system-overlap, the reported EERs characterize 'in-domain' performance rather than 'generalization' as claimed in the title.

minor comments (4)

[Abstract] Typo: 'publickly' should be 'publicly'.
[§3.3] The augmentation probability is set to 0.5. It would be helpful to justify this choice or provide a sensitivity analysis; specifically, whether the EER increases linearly with the severity of the distortion or if certain codecs (like G.711) have a disproportionate impact on detection.
[Table 1] The table lists the number of files, but it would be beneficial to also report the total duration in hours for each split (Bona fide vs. Spoof) to give a better sense of dataset scale.
[Figure 3] The t-SNE plot is informative, but the color contrast between some synthesis systems is difficult to distinguish. Highlighting groups (e.g., GAN-based vs. Diffusion-based) might improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the reviewer for the constructive feedback and for recognizing the significance of RuASD as a benchmark for the Russian language. We agree that the risks of 'channel leakage' and the definition of 'generalization' are critical considerations for the validity of anti-spoofing research. We address these points below by committing to supplementary acoustic analyses and more granular reporting of results on unseen synthesis architectures.

read point-by-point responses

Referee: [§3.1, §3.2, and Table 2] There is a significant risk of 'channel leakage' or shortcut learning... The authors should provide an analysis (e.g., an evaluation on silent segments or a comparison of noise floors) to ensure the benchmark is measuring anti-spoofing rather than channel detection.

Authors: The reviewer raises a critical point regarding the acoustic mismatch between the heterogeneous bona fide corpora and the digitally synthesized spoofed audio. This 'channel shortcut' is indeed a known phenomenon in benchmarks like ASVspoof 2019. We acknowledge that the 0.82% EER for XLS-R on the 'Clean' set may partly reflect the model's sensitivity to the presence of microphone/ambient noise in the bona fide samples versus the digital perfection of the synthesizers. To address this, we will incorporate an auxiliary analysis in the revised manuscript: 1) A comparison of the Long-Term Average Spectra (LTAS) and noise floors between the subsets. 2) Results of an experiment where the models are tested solely on the silent regions of the audio. Furthermore, we will emphasize in the text that the 'Robustness' track (Section 4.3), which applies uniform noise and reverberation to both classes, is intended specifically to suppress these channel-based shortcuts and provides a more honest assessment of anti-spoofing performance. revision: partial
Referee: [§4] The partitioning strategy requires more detail regarding the 'diverse' synthesis systems... If there is system-overlap, the reported EERs characterize 'in-domain' performance rather than 'generalization' as claimed in the title.

Authors: We agree that 'generalization' should ideally imply performance on architectures not encountered during training. In the current version of RuASD, while speakers are strictly disjoint, some synthesis systems (architectures) are present in both Training and Evaluation sets to ensure robust baseline training across diverse technologies. However, we acknowledge that this conflates in-domain robustness with cross-system generalization. In the revised manuscript, we will: 1) Update Table 1 to explicitly mark which of the 37 synthesis systems are 'seen' vs. 'unseen' during evaluation. 2) Provide a breakdown of the EER for the 'unseen' systems subset in Table 2. This will allow researchers to clearly distinguish between a model's ability to handle known synthesis artifacts and its ability to generalize to novel, unseen generators. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified in dataset construction or benchmarking.

full rationale

The RuASD paper follows a standard empirical methodology for machine learning dataset papers: construction of a corpus (mixing heterogeneous bona fide sources with diverse synthetic outputs), application of a perturbation pipeline (referenced to prior work by the authors), and benchmarking with several independent model architectures. The Pith circularity check finds no evidence of results being forced by definition or construction. The 'predictions' (Error Rates) are empirical outputs of models trained on standard splits, not re-mapped inputs. While the authors use their own previous work for the augmentation logic (FaS-PS) and one of the baseline models (GAT-LID), they also evaluate against widely accepted external baselines (AASIST, RawNet2, XLS-R), providing an independent check on the dataset's utility. The skeptic's concern regarding 'channel leakage'—where models might distinguish bona fide from spoofed samples based on background noise signatures rather than synthesis artifacts—is a valid critique of the benchmark's realism (correctness risk), but it does not constitute a circular derivation. The models are not guaranteed to succeed; in fact, the results show significant performance degradation on the robustness tracks, confirming that the benchmark results are not predetermined by the input definitions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper relies on standard machine learning practices and signal processing techniques applied to a new data domain.

free parameters (1)

Augmentation Ratios
The specific SNR levels and reverberation times used in the robustness pipeline are fixed parameters that define the 'difficulty' of the noisy subset.

axioms (1)

domain assumption Diversity of synthesis systems translates to model generalization.
The paper assumes that training on a subset of the 37 synthesis systems will allow a model to generalize to unseen 'in-the-wild' synthesis methods.

pith-pipeline@v0.9.0 · 6310 in / 1426 out tokens · 12010 ms · 2026-05-08T02:25:22.435346+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 19 canonical work pages · 2 internal anchors

[1]

Soong, and Tie-Yan Liu

X. Tan, T. Qin, F. Soong, and T.-Y . Liu, “A survey on neural speech synthesis,” 2021. [Online]. Available: https://arxiv.org/abs/2106.15561

work page arXiv 2021
[2]

Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,

Y . A. Li, C. Han, V . S. Raghavan, G. Mischler, and N. Mesgarani, “Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,” 2023. [Online]. Available: https://arxiv.org/abs/2306.07691

work page arXiv 2023
[4]

Asvspoof 2021: Towards spoofed and deepfake speech detection in VOLUME 4, 2016 13 Authoret al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS the wild,

X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, and K. A. Lee, “Asvspoof 2021: Towards spoofed and deepfake speech detection in VOLUME 4, 2016 13 Authoret al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS the wild,” IEEE/ACM Transactions on Audio, Speech, and Language Processi...

work page doi:10.1109/taslp.2023.3285283 2021
[5]

Unmasking real-world audio deepfakes: A data-centric approach,

D. Combei, A. Stan, D. Oneata, N. Müller, and H. Cucu, “Unmasking real-world audio deepfakes: A data-centric approach,” in Interspeech 2025, ser. interspeech_2025. ISCA, Aug. 2025, p. 5343–5347. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2025-100

work page doi:10.21437/interspeech.2025-100 2025
[6]

Mlaad: The multi- language audio anti-spoofing dataset,

N. M. Müller, P. Kawa, W. H. Choong, E. Casanova, E. Gölge, T. Müller, P. Syga, P. Sperl, and K. Böttinger, “Mlaad: The multi- language audio anti-spoofing dataset,” 2026. [Online]. Available: https://arxiv.org/abs/2401.09512

work page arXiv 2026
[7]

Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,

X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V . Vestman, T. Kinnunen, K. A. Lee, L. Juvela, P. Alku, Y .-H. Peng, H.-T. Hwang, Y . Tsao, H.-M. Wang, S. L. Maguer, M. Becker, F. Henderson, R. Clark, Y . Zhang, Q. Wang, Y . Jia, K. Onuma, K. Mushika, T. Kaneda, Y . Jiang, L.-J. Liu, Y .-C. Wu, W.-C. Huang, T. Toda, K....

work page arXiv 2019
[8]

ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection,

J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans, and H. Delgado, “ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection,” in 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 47–54

2021
[9]

For: A dataset for synthetic speech detection,

R. Reimao and V . Tzerpos, “For: A dataset for synthetic speech detection,” in 2019 International Conference on Speech Technology and Human- Computer Dialogue (SpeD), 2019, pp. 1–10

2019
[10]

Wavefake: A data set to facilitate audio deepfake detection,

J. Frank and L. Schönherr, “Wavefake: A data set to facilitate audio deepfake detection,” in Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, J. Vanschoren and S. Yeung, Eds., vol. 1, 2021. [Online]. Avail- able: https://datasets-benchmarks-proceedings.neurips.cc/paper_files/ paper/2021/file/c74d97b01eae257e44aa9d...

2021
[11]

Does Audio Deepfake Detection Generalize?

N. Müller, P. Czempin, F. Diekmann, A. Froghyar, and K. Böttinger, “Does Audio Deepfake Detection Generalize?” in Interspeech 2022, 2022, pp. 2783–2787

2022
[12]

Add 2022: the first audio deep synthesis detection challenge,

J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y . Bai, C. Fan, S. Liang, S. Wang, S. Zhang, X. Yan, L. Xu, Z. Wen, and H. Li, “Add 2022: the first audio deep synthesis detection challenge,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 9216–9220

2022
[13]

Speechfake: A large-scale multilingual speech deepfake dataset incorporating cutting- edge generation methods,

W. Huang, Y . Gu, Z. Wang, H. Zhu, and Y . Qian, “Speechfake: A large-scale multilingual speech deepfake dataset incorporating cutting- edge generation methods,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2025, pp. 9985–9998

2025
[14]

SynHate: Detecting Hate Speech in Synthetic Deepfake Audio,

R. Ranjan, K. Pipariya, M. Vatsa, and R. Singh, “SynHate: Detecting Hate Speech in Synthetic Deepfake Audio,” in Interspeech 2025, 2025, pp. 5623–5627

2025
[15]

XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark,

I.-P. Ciobanu, A.-I. Hiji, N.-C. Ristea, P. Irofti, C. Rusu, and R. T. Ionescu, “XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark,” arXiv preprint arXiv:2506.00462, 2025

work page arXiv 2025
[16]

Asvspoof 5: Design, collection and validation of resources for spoofing, deepfake, and adversarial attack detection using crowdsourced speech,

X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunen, N. Evans, K. A. Lee, J. Yamagishi, M. Jeong, G. Zhu, Y . Zang, Y . Zhang, S. Maiti, F. Lux, N. Müller, W. Zhang, C. Sun, S. Hou, S. Lyu, S. L. Maguer, C. Gong, H. Guo, L. Chen, and V . Singh, “Asvspoof 5: Design, collection and validation of...

work page arXiv 2025
[17]

Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,

J. weon Jung, H.-S. Heo, H. Tak, H. jin Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” 2021. [Online]. Available: https://arxiv.org/abs/2110.01200

work page arXiv 2021
[18]

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,

H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” in The Speaker and Language Recognition Workshop, 2022

2022
[19]

Capsule-based and tcn-based approaches for spoofing detection in voice biometry,

K. Borodin, V . Kudryavtsev, G. Mkrtchian, and M. Gorodnichev, “Capsule-based and tcn-based approaches for spoofing detection in voice biometry,” Engineering, Technology; Applied Science Research, vol. 14, no. 6, p. 18409–18414, Dec. 2024. [Online]. Available: https://etasr.com/index.php/ETASR/article/view/8906

2024
[20]

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection,

D.-T. Truong, R. Tao, T. Nguyen, H.-T. Luong, K. A. Lee, and E. S. Chng, “Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection,” in Interspeech 2024, 2024, pp. 537–541

2024
[21]

Towards scalable aasist: Refining graph attention for speech deepfake detection,

I. Viakhirev, D. Sirota, A. Smirnov, and K. Borodin, “Towards scalable aasist: Refining graph attention for speech deepfake detection,” 2025. [Online]. Available: https://arxiv.org/abs/2507.11777

work page arXiv 2025
[22]

AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge,

K. Borodin, V . Kudryavtsev, D. Korzh, A. Efimenko, G. Mkrtchian, M. Gorodnichev, and O. Y . Rogov, “AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge,” in The Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 48–55

2024
[23]

Interpreting multi-branch anti-spoofing architectures: Correlating internal strategy with empirical performance,

I. Viakhirev, K. Borodin, M. Gorodnichev, and G. Mkrtchian, “Interpreting multi-branch anti-spoofing architectures: Correlating internal strategy with empirical performance,” Mathematics, vol. 14, no. 2, 2026. [Online]. Available: https://www.mdpi.com/2227-7390/14/2/381

2026
[24]

Audio deepfake detection with self- supervised xls-r and sls classifier,

Q. Zhang, S. Wen, and T. Hu, “Audio deepfake detection with self- supervised xls-r and sls classifier,” in Proceedings of the 32nd ACM International Conference on Multimedia, ser. MM ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 6765–6773. [Online]. Available: https://doi.org/10.1145/3664647.3681345

work page doi:10.1145/3664647.3681345 2024
[25]

Nes2net: A lightweight nested architecture for foundation model driven speech anti-spoofing,

T. Liu, D.-T. Truong, R. Kumar Das, K. Aik Lee, and H. Li, “Nes2net: A lightweight nested architecture for foundation model driven speech anti-spoofing,” IEEE Transactions on Information Forensics and Security, vol. 20, pp. 12 005–12 018, 2025

2025
[26]

Df_arena_1b_v_1 - universal audio deepfake detection,

A. Kulkarni, A. Kulkarni, S. Dowerah, M. M. Doss, and T. Alumäe, “Df_arena_1b_v_1 - universal audio deepfake detection,” 2025. [Online]. Available: https://huggingface.co/Speech-Arena-2025/DF_Arena_1B_V_ 1/

2025
[27]

Df_arena_500m_v_1 - universal audio deepfake detection,

——, “Df_arena_500m_v_1 - universal audio deepfake detection,”
[28]

Available: https://huggingface.co/Speech-Arena-2025/ DF_Arena_500M_V_1/

[Online]. Available: https://huggingface.co/Speech-Arena-2025/ DF_Arena_500M_V_1/

2025
[29]

F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” 2025. [Online]. Available: https://arxiv.org/abs/2410.06885

work page arXiv 2025
[30]

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,

J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol
[31]

5530–5540

PMLR, 18–24 Jul 2021, pp. 5530–5540. [Online]. Available: https://proceedings.mlr.press/v139/kim21f.html

2021
[32]

Scaling speech technology to 1,000+ languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y . Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scaling speech technology to 1,000+ languages,” 2023. [Online]. Available: https: //arxiv.org/abs/2305.13516

work page arXiv 2023
[33]

VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design,

J. Kong, J. Park, B. Kim, J. Kim, D. Kong, and S. Kim, “VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design,” in Interspeech 2023, 2023, pp. 4374– 4378

2023
[34]

Gpt-sovits: 1 min voice data can also be used to train a good tts model!

RVC-Boss and others, “Gpt-sovits: 1 min voice data can also be used to train a good tts model!” https://github.com/RVC-Boss/GPT-SoVITS, 2024

2024
[35]

XTTS: a Massively Mul- tilingual Zero-Shot Text-to-Speech Model,

E. Casanova, K. Davis, E. Gölge, G. Göknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi, and J. Weber, “XTTS: a Massively Mul- tilingual Zero-Shot Text-to-Speech Model,” in Interspeech 2024, 2024, pp. 4978–4982

2024
[36]

Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,

J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 17 022–17 033. [Online]. Available: https://proceedings.neurips.cc/p...

2020
[37]

Fastpitch: Parallel text-to-speech with pitch prediction,

A. Ła ´ncucki, “Fastpitch: Parallel text-to-speech with pitch prediction,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6588–6592

2021
[38]

Waveglow: A flow-based generative network for speech synthesis,

R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” 2018. [Online]. Available: https://arxiv.org/abs/1811.00002

work page arXiv 2018
[39]

Bark: A transformer-based text-to-audio model,

S. AI, “Bark: A transformer-based text-to-audio model,” https://github. com/suno-ai/bark, 2023

2023
[40]

Grad-tts: A diffusion probabilistic model for text-to-speech,

V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, and M. Kudinov, “Grad-tts: A diffusion probabilistic model for text-to-speech,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 14 VOLUME 4, 2016 Authoret al.: Preparation of Papers for IEEE TRANSACTIONS...

2016
[41]

8599–8608

PMLR, 18–24 Jul 2021, pp. 8599–8608. [Online]. Available: https://proceedings.mlr.press/v139/popov21a.html

2021
[42]

Fish-speech: Leveraging large language models for advanced multilingual text-to- speech synthesis.arXiv preprint arXiv:2411.01156, 2024

S. Liao, Y . Wang, T. Li, Y . Cheng, R. Zhang, R. Zhou, and Y . Xing, “Fish-speech: Leveraging large language models for advanced multilingual text-to-speech synthesis,” 2024. [Online]. Available: https: //arxiv.org/abs/2411.01156

work page arXiv 2024
[43]

Rhvoice: A free and open-source speech synthesizer,

RHV oice Project, “Rhvoice: A free and open-source speech synthesizer,” https://github.com/RHV oice/RHV oice, 2022, gitHub repository, README documents features and platform interfaces (SAPI5, Speech Dispatcher, Android TTS APIs). [Online]. Available: https://github.com/RHV oice/RHV oice

2022
[44]

Silero models: Pre-trained enterprise-grade stt / tts models and benchmarks,

Silero Team, “Silero models: Pre-trained enterprise-grade stt / tts models and benchmarks,” https://github.com/snakers4/silero-models, 2021

2021
[45]

Neural speech synthesis with transformer network,

N. Li, S. Liu, Y . Liu, S. Zhao, and M. Liu, “Neural speech synthesis with transformer network,” ser. AAAI’19/IAAI’19/EAAI’19. AAAI Press,
[46]

Available: https://doi.org/10.1609/aaai.v33i01.33016706

[Online]. Available: https://doi.org/10.1609/aaai.v33i01.33016706

work page doi:10.1609/aaai.v33i01.33016706
[47]

JunyiAo,RuiWang,LongZhou,ChengyiWang,ShuoRen,YuWu,ShujieLiu,TomKo,QingLi,YuZhang,etal

J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y . Wu, S. Liu, T. Ko, Q. Li, Y . Zhang, Z. Wei, Y . Qian, J. Li, and F. Wei, “Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing,” 2022. [Online]. Available: https://arxiv.org/abs/2110.07205

work page arXiv 2022
[48]

NISQA: A Deep CNN- Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,

G. Mittag, B. Naderi, A. Chehadi, and S. Möller, “NISQA: A Deep CNN- Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” in Interspeech 2021, 2021, pp. 2127–2131

2021
[49]

Robust Speech Recognition via Large-Scale Weak Supervision

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022. [Online]. Available: https://arxiv.org/abs/2212.04356

work page internal anchor Pith review arXiv 2022
[50]

Russian speech recognition system based on mozilla’s deep- speech tensorflow implementation,

G. Fedoseev, “Russian speech recognition system based on mozilla’s deep- speech tensorflow implementation,” https://github.com/GeorgeFedoseev/ DeepSpeech, 2017, forked from mozilla/DeepSpeech

2017
[51]

Golos: Russian Dataset for Speech Research,

N. Karpov, A. Denisenko, and F. Minkin, “Golos: Russian Dataset for Speech Research,” in Interspeech 2021, 2021, pp. 1419–1423

2021
[52]

The m-ailabs speech dataset,

I. C. A. Solak and D. Naumov, “The m-ailabs speech dataset,” https: //github.com/i-celeste-aurora/m-ailabs-dataset, 2017

2017
[53]

Openstt: The russian open speech-to-text dataset,

A. Bondarenko, “Openstt: The russian open speech-to-text dataset,” https: //github.com/snakers4/open_stt, 2019

2019
[54]

A data-centric framework for addressing phonetic and prosodic challenges in russian speech generative models,

K. Borodin, N. Vasiliev, V . Kudryavtsev, M. Maslov, M. Gorodnichev, O. Rogov, and G. Mkrtchian, “A data-centric framework for addressing phonetic and prosodic challenges in russian speech generative models,”
[55]

Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech

[Online]. Available: https://arxiv.org/abs/2507.13563

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Russian librispeech (ruls) dataset,

“Russian librispeech (ruls) dataset,” https://openslr.org/96/, 2021

2021
[57]

Ruslan: Russian spoken language corpus for speech synthesis,

L. Gabdrakhmanov, R. Garaev, and E. Razinkov, “Ruslan: Russian spoken language corpus for speech synthesis,” in Speech and Computer. Cham: Springer International Publishing, 2019, pp. 113–121

2019
[58]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the Twelfth Language Resources and Evaluation Conference, N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. M...

2020
[59]

Sova dataset: Multilingual stt/asr corpus,

SOV A AI, “Sova dataset: Multilingual stt/asr corpus,” https://github.com/ sovaai/sova-dataset, 2022

2022
[60]

ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,

M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. H. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,” in Interspeech 2019, 2019, pp. 1008–1012. KSENIA L YSIKOVAwas born in Moscow, Rus- sia, in 2004. She is currently pursuing the B.S. de- gree in computer scienc...

2019