Recognition: unknown
Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative
Pith reviewed 2026-05-08 02:25 UTC · model gemini-3-flash-preview
The pith
A new benchmark tests if AI can spot fake Russian speech in the real world.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors demonstrate that current anti-spoofing models struggle significantly when faced with 'unseen' synthesis techniques—those not present during the model's training—and when audio is passed through realistic communication channels like codecs or reverb. They find that while large, pre-trained models generally perform better, no single architecture is currently immune to the performance drops caused by common audio distortions. The RuASD dataset serves as a diagnostic tool to identify these specific failure points in Russian-language security applications.
What carries the argument
The RuASD (Russian Anti-Spoofing Dataset) pipeline, which integrates 37 distinct text-to-speech and voice-cloning systems with a 'perturbation chain' that simulates real-world audio degradation like room echo and digital compression.
If this is right
- Security developers can now measure exactly how much their voice-authentication systems degrade when a user calls from a noisy environment.
- The high diversity of synthesis systems sets a new standard for what generalization means in voice-spoofing detection.
- Future detection models will likely need to incorporate channel-aware training to remain effective against modern digital transmission.
- Researchers can compare the effectiveness of lightweight models versus large-scale pre-trained systems on a level playing field for the Russian language.
Where Pith is reading between the lines
- The gap between laboratory performance and real-world deployment suggests that current performance claims in voice security may be significantly overoptimistic.
- The methodology used here for Russian could be adapted to create similar stress-test benchmarks for lower-resource languages where voice-cloning threats are emerging.
- If the simulated distortions in the dataset become a standard training target, attackers may shift toward high-fidelity spoofs that exploit the detection model's over-reliance on identifying channel noise.
Load-bearing premise
The simulated distortions, such as added noise and digital compression, are assumed to be a perfect match for the messy audio quality found in actual fraudulent phone calls.
What would settle it
If a detection model that performs perfectly on the RuASD benchmark consistently fails to catch actual fraudulent Russian voice-clones in a live banking or security environment, then the benchmark's simulated conditions are not sufficiently representative of real-world attacks.
Figures
read the original abstract
RuASD (Russian AntiSpoofing Dataset) is a dedicated, reproducible benchmark for Russian-language speech anti-spoofing designed to evaluate both in-domain discrimination and robustness to deployment-style distribution shifts. It combines a large spoof subset synthesized using 37 modern Russian-capable TTS and voice-cloning systems with a bona fide subset curated from multiple heterogeneous open Russian speech corpora, enabling systematic evaluation across diverse data sources. To emulate typical dissemination and channel effects in a controlled and reproducible manner, RuASD includes configurable simulations of platform and transmission distortions, including room reverberation, additive noise/music, and a range of speech-codec transcodings implemented via a unified processing chain. We benchmark a diverse set of publicly available anti-spoofing countermeasures spanning lightweight supervised architectures, graph-attention models, SSL-based detectors, and large-scale pretrained systems, and report reference results on both clean and simulated conditions to characterize robustness under realistic perturbation pipelines. The dataset is publickly available at \href{https://huggingface.co/datasets/MTUCI/RuASD}{\underline{Hugging Face}} and \href{https://modelscope.cn/datasets/lab260/RuASD}{\underline{ModelScope}}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Russian AntiSpoofing Dataset (RuASD), a benchmark designed to evaluate the generalization and robustness of anti-spoofing systems for the Russian language. The dataset addresses the linguistic gap in existing benchmarks by incorporating 37 different Russian-capable TTS and VC systems. It also includes a standardized processing chain to simulate realistic channel distortions like reverberation, noise, and codec compression. The authors benchmark several modern architectures (AASIST, RawNet2, XLS-R, etc.) across clean and distorted conditions, providing a baseline for future research in Russian speech security.
Significance. RuASD represents a significant contribution to the field of speech anti-spoofing, particularly for non-English languages. The inclusion of 37 distinct synthesis systems provides high diversity, which is critical for testing generalization beyond specific generator artifacts. The paper's strength lies in its commitment to reproducibility, providing both the dataset and the augmentation pipeline via public repositories (Hugging Face/ModelScope). By formalizing a 'Robustness' evaluation set with controlled perturbations, the authors move the field toward more realistic performance metrics than standard 'clean' evaluations.
major comments (2)
- [§3.1, §3.2, and Table 2] There is a significant risk of 'channel leakage' or shortcut learning. The bona fide data is sourced from three heterogeneous corpora (ROSC, Golos, Sber-devices) recorded via microphones in various environments, whereas the spoofed data is synthesized using neural TTS/VC systems, which are inherently 'clean' at the point of generation. In Table 2, the extremely low EER for XLS-R (0.82% on the 'Clean' set) may reflect the model's ability to distinguish the background acoustic signature (microphone noise, room tone) of the original bona fide corpora from the silence/digital-perfection of the spoofed systems, rather than detecting synthesis artifacts. The authors should provide an analysis (e.g., an evaluation on silent segments or a comparison of noise floors) to ensure the benchmark is measuring anti-spoofing rather than channel detection.
- [§4] The partitioning strategy requires more detail regarding the 'diverse' synthesis systems. While the paper states that speakers are disjoint across sets, it is unclear if specific TTS/VC systems are also disjoint. To evaluate true generalization to 'unseen' attacks, the Evaluation set should ideally contain output from synthesis architectures not present in the Training/Dev sets. If there is system-overlap, the reported EERs characterize 'in-domain' performance rather than 'generalization' as claimed in the title.
minor comments (4)
- [Abstract] Typo: 'publickly' should be 'publicly'.
- [§3.3] The augmentation probability is set to 0.5. It would be helpful to justify this choice or provide a sensitivity analysis; specifically, whether the EER increases linearly with the severity of the distortion or if certain codecs (like G.711) have a disproportionate impact on detection.
- [Table 1] The table lists the number of files, but it would be beneficial to also report the total duration in hours for each split (Bona fide vs. Spoof) to give a better sense of dataset scale.
- [Figure 3] The t-SNE plot is informative, but the color contrast between some synthesis systems is difficult to distinguish. Highlighting groups (e.g., GAN-based vs. Diffusion-based) might improve readability.
Simulated Author's Rebuttal
We thank the reviewer for the constructive feedback and for recognizing the significance of RuASD as a benchmark for the Russian language. We agree that the risks of 'channel leakage' and the definition of 'generalization' are critical considerations for the validity of anti-spoofing research. We address these points below by committing to supplementary acoustic analyses and more granular reporting of results on unseen synthesis architectures.
read point-by-point responses
-
Referee: [§3.1, §3.2, and Table 2] There is a significant risk of 'channel leakage' or shortcut learning... The authors should provide an analysis (e.g., an evaluation on silent segments or a comparison of noise floors) to ensure the benchmark is measuring anti-spoofing rather than channel detection.
Authors: The reviewer raises a critical point regarding the acoustic mismatch between the heterogeneous bona fide corpora and the digitally synthesized spoofed audio. This 'channel shortcut' is indeed a known phenomenon in benchmarks like ASVspoof 2019. We acknowledge that the 0.82% EER for XLS-R on the 'Clean' set may partly reflect the model's sensitivity to the presence of microphone/ambient noise in the bona fide samples versus the digital perfection of the synthesizers. To address this, we will incorporate an auxiliary analysis in the revised manuscript: 1) A comparison of the Long-Term Average Spectra (LTAS) and noise floors between the subsets. 2) Results of an experiment where the models are tested solely on the silent regions of the audio. Furthermore, we will emphasize in the text that the 'Robustness' track (Section 4.3), which applies uniform noise and reverberation to both classes, is intended specifically to suppress these channel-based shortcuts and provides a more honest assessment of anti-spoofing performance. revision: partial
-
Referee: [§4] The partitioning strategy requires more detail regarding the 'diverse' synthesis systems... If there is system-overlap, the reported EERs characterize 'in-domain' performance rather than 'generalization' as claimed in the title.
Authors: We agree that 'generalization' should ideally imply performance on architectures not encountered during training. In the current version of RuASD, while speakers are strictly disjoint, some synthesis systems (architectures) are present in both Training and Evaluation sets to ensure robust baseline training across diverse technologies. However, we acknowledge that this conflates in-domain robustness with cross-system generalization. In the revised manuscript, we will: 1) Update Table 1 to explicitly mark which of the 37 synthesis systems are 'seen' vs. 'unseen' during evaluation. 2) Provide a breakdown of the EER for the 'unseen' systems subset in Table 2. This will allow researchers to clearly distinguish between a model's ability to handle known synthesis artifacts and its ability to generalize to novel, unseen generators. revision: yes
Circularity Check
No significant circularity identified in dataset construction or benchmarking.
full rationale
The RuASD paper follows a standard empirical methodology for machine learning dataset papers: construction of a corpus (mixing heterogeneous bona fide sources with diverse synthetic outputs), application of a perturbation pipeline (referenced to prior work by the authors), and benchmarking with several independent model architectures. The Pith circularity check finds no evidence of results being forced by definition or construction. The 'predictions' (Error Rates) are empirical outputs of models trained on standard splits, not re-mapped inputs. While the authors use their own previous work for the augmentation logic (FaS-PS) and one of the baseline models (GAT-LID), they also evaluate against widely accepted external baselines (AASIST, RawNet2, XLS-R), providing an independent check on the dataset's utility. The skeptic's concern regarding 'channel leakage'—where models might distinguish bona fide from spoofed samples based on background noise signatures rather than synthesis artifacts—is a valid critique of the benchmark's realism (correctness risk), but it does not constitute a circular derivation. The models are not guaranteed to succeed; in fact, the results show significant performance degradation on the robustness tracks, confirming that the benchmark results are not predetermined by the input definitions.
Axiom & Free-Parameter Ledger
free parameters (1)
- Augmentation Ratios
axioms (1)
- domain assumption Diversity of synthesis systems translates to model generalization.
Reference graph
Works this paper leans on
-
[1]
X. Tan, T. Qin, F. Soong, and T.-Y . Liu, “A survey on neural speech synthesis,” 2021. [Online]. Available: https://arxiv.org/abs/2106.15561
-
[2]
Y . A. Li, C. Han, V . S. Raghavan, G. Mischler, and N. Mesgarani, “Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,” 2023. [Online]. Available: https://arxiv.org/abs/2306.07691
-
[4]
X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, and K. A. Lee, “Asvspoof 2021: Towards spoofed and deepfake speech detection in VOLUME 4, 2016 13 Authoret al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS the wild,” IEEE/ACM Transactions on Audio, Speech, and Language Processi...
-
[5]
Unmasking real-world audio deepfakes: A data-centric approach,
D. Combei, A. Stan, D. Oneata, N. Müller, and H. Cucu, “Unmasking real-world audio deepfakes: A data-centric approach,” in Interspeech 2025, ser. interspeech_2025. ISCA, Aug. 2025, p. 5343–5347. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2025-100
-
[6]
Mlaad: The multi- language audio anti-spoofing dataset,
N. M. Müller, P. Kawa, W. H. Choong, E. Casanova, E. Gölge, T. Müller, P. Syga, P. Sperl, and K. Böttinger, “Mlaad: The multi- language audio anti-spoofing dataset,” 2026. [Online]. Available: https://arxiv.org/abs/2401.09512
-
[7]
Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,
X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V . Vestman, T. Kinnunen, K. A. Lee, L. Juvela, P. Alku, Y .-H. Peng, H.-T. Hwang, Y . Tsao, H.-M. Wang, S. L. Maguer, M. Becker, F. Henderson, R. Clark, Y . Zhang, Q. Wang, Y . Jia, K. Onuma, K. Mushika, T. Kaneda, Y . Jiang, L.-J. Liu, Y .-C. Wu, W.-C. Huang, T. Toda, K....
-
[8]
ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection,
J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans, and H. Delgado, “ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection,” in 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 47–54
2021
-
[9]
For: A dataset for synthetic speech detection,
R. Reimao and V . Tzerpos, “For: A dataset for synthetic speech detection,” in 2019 International Conference on Speech Technology and Human- Computer Dialogue (SpeD), 2019, pp. 1–10
2019
-
[10]
Wavefake: A data set to facilitate audio deepfake detection,
J. Frank and L. Schönherr, “Wavefake: A data set to facilitate audio deepfake detection,” in Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, J. Vanschoren and S. Yeung, Eds., vol. 1, 2021. [Online]. Avail- able: https://datasets-benchmarks-proceedings.neurips.cc/paper_files/ paper/2021/file/c74d97b01eae257e44aa9d...
2021
-
[11]
Does Audio Deepfake Detection Generalize?
N. Müller, P. Czempin, F. Diekmann, A. Froghyar, and K. Böttinger, “Does Audio Deepfake Detection Generalize?” in Interspeech 2022, 2022, pp. 2783–2787
2022
-
[12]
Add 2022: the first audio deep synthesis detection challenge,
J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y . Bai, C. Fan, S. Liang, S. Wang, S. Zhang, X. Yan, L. Xu, Z. Wen, and H. Li, “Add 2022: the first audio deep synthesis detection challenge,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 9216–9220
2022
-
[13]
Speechfake: A large-scale multilingual speech deepfake dataset incorporating cutting- edge generation methods,
W. Huang, Y . Gu, Z. Wang, H. Zhu, and Y . Qian, “Speechfake: A large-scale multilingual speech deepfake dataset incorporating cutting- edge generation methods,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2025, pp. 9985–9998
2025
-
[14]
SynHate: Detecting Hate Speech in Synthetic Deepfake Audio,
R. Ranjan, K. Pipariya, M. Vatsa, and R. Singh, “SynHate: Detecting Hate Speech in Synthetic Deepfake Audio,” in Interspeech 2025, 2025, pp. 5623–5627
2025
-
[15]
XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark,
I.-P. Ciobanu, A.-I. Hiji, N.-C. Ristea, P. Irofti, C. Rusu, and R. T. Ionescu, “XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark,” arXiv preprint arXiv:2506.00462, 2025
-
[16]
X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunen, N. Evans, K. A. Lee, J. Yamagishi, M. Jeong, G. Zhu, Y . Zang, Y . Zhang, S. Maiti, F. Lux, N. Müller, W. Zhang, C. Sun, S. Hou, S. Lyu, S. L. Maguer, C. Gong, H. Guo, L. Chen, and V . Singh, “Asvspoof 5: Design, collection and validation of...
-
[17]
Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,
J. weon Jung, H.-S. Heo, H. Tak, H. jin Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” 2021. [Online]. Available: https://arxiv.org/abs/2110.01200
-
[18]
Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,
H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” in The Speaker and Language Recognition Workshop, 2022
2022
-
[19]
Capsule-based and tcn-based approaches for spoofing detection in voice biometry,
K. Borodin, V . Kudryavtsev, G. Mkrtchian, and M. Gorodnichev, “Capsule-based and tcn-based approaches for spoofing detection in voice biometry,” Engineering, Technology; Applied Science Research, vol. 14, no. 6, p. 18409–18414, Dec. 2024. [Online]. Available: https://etasr.com/index.php/ETASR/article/view/8906
2024
-
[20]
Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection,
D.-T. Truong, R. Tao, T. Nguyen, H.-T. Luong, K. A. Lee, and E. S. Chng, “Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection,” in Interspeech 2024, 2024, pp. 537–541
2024
-
[21]
Towards scalable aasist: Refining graph attention for speech deepfake detection,
I. Viakhirev, D. Sirota, A. Smirnov, and K. Borodin, “Towards scalable aasist: Refining graph attention for speech deepfake detection,” 2025. [Online]. Available: https://arxiv.org/abs/2507.11777
-
[22]
AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge,
K. Borodin, V . Kudryavtsev, D. Korzh, A. Efimenko, G. Mkrtchian, M. Gorodnichev, and O. Y . Rogov, “AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge,” in The Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 48–55
2024
-
[23]
Interpreting multi-branch anti-spoofing architectures: Correlating internal strategy with empirical performance,
I. Viakhirev, K. Borodin, M. Gorodnichev, and G. Mkrtchian, “Interpreting multi-branch anti-spoofing architectures: Correlating internal strategy with empirical performance,” Mathematics, vol. 14, no. 2, 2026. [Online]. Available: https://www.mdpi.com/2227-7390/14/2/381
2026
-
[24]
Audio deepfake detection with self- supervised xls-r and sls classifier,
Q. Zhang, S. Wen, and T. Hu, “Audio deepfake detection with self- supervised xls-r and sls classifier,” in Proceedings of the 32nd ACM International Conference on Multimedia, ser. MM ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 6765–6773. [Online]. Available: https://doi.org/10.1145/3664647.3681345
-
[25]
Nes2net: A lightweight nested architecture for foundation model driven speech anti-spoofing,
T. Liu, D.-T. Truong, R. Kumar Das, K. Aik Lee, and H. Li, “Nes2net: A lightweight nested architecture for foundation model driven speech anti-spoofing,” IEEE Transactions on Information Forensics and Security, vol. 20, pp. 12 005–12 018, 2025
2025
-
[26]
Df_arena_1b_v_1 - universal audio deepfake detection,
A. Kulkarni, A. Kulkarni, S. Dowerah, M. M. Doss, and T. Alumäe, “Df_arena_1b_v_1 - universal audio deepfake detection,” 2025. [Online]. Available: https://huggingface.co/Speech-Arena-2025/DF_Arena_1B_V_ 1/
2025
-
[27]
Df_arena_500m_v_1 - universal audio deepfake detection,
——, “Df_arena_500m_v_1 - universal audio deepfake detection,”
-
[28]
Available: https://huggingface.co/Speech-Arena-2025/ DF_Arena_500M_V_1/
[Online]. Available: https://huggingface.co/Speech-Arena-2025/ DF_Arena_500M_V_1/
2025
-
[29]
F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,
Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” 2025. [Online]. Available: https://arxiv.org/abs/2410.06885
-
[30]
Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,
J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol
-
[31]
5530–5540
PMLR, 18–24 Jul 2021, pp. 5530–5540. [Online]. Available: https://proceedings.mlr.press/v139/kim21f.html
2021
-
[32]
Scaling speech technology to 1,000+ languages,
V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y . Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scaling speech technology to 1,000+ languages,” 2023. [Online]. Available: https: //arxiv.org/abs/2305.13516
-
[33]
VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design,
J. Kong, J. Park, B. Kim, J. Kim, D. Kong, and S. Kim, “VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design,” in Interspeech 2023, 2023, pp. 4374– 4378
2023
-
[34]
Gpt-sovits: 1 min voice data can also be used to train a good tts model!
RVC-Boss and others, “Gpt-sovits: 1 min voice data can also be used to train a good tts model!” https://github.com/RVC-Boss/GPT-SoVITS, 2024
2024
-
[35]
XTTS: a Massively Mul- tilingual Zero-Shot Text-to-Speech Model,
E. Casanova, K. Davis, E. Gölge, G. Göknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi, and J. Weber, “XTTS: a Massively Mul- tilingual Zero-Shot Text-to-Speech Model,” in Interspeech 2024, 2024, pp. 4978–4982
2024
-
[36]
Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,
J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 17 022–17 033. [Online]. Available: https://proceedings.neurips.cc/p...
2020
-
[37]
Fastpitch: Parallel text-to-speech with pitch prediction,
A. Ła ´ncucki, “Fastpitch: Parallel text-to-speech with pitch prediction,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6588–6592
2021
-
[38]
Waveglow: A flow-based generative network for speech synthesis,
R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” 2018. [Online]. Available: https://arxiv.org/abs/1811.00002
-
[39]
Bark: A transformer-based text-to-audio model,
S. AI, “Bark: A transformer-based text-to-audio model,” https://github. com/suno-ai/bark, 2023
2023
-
[40]
Grad-tts: A diffusion probabilistic model for text-to-speech,
V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, and M. Kudinov, “Grad-tts: A diffusion probabilistic model for text-to-speech,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 14 VOLUME 4, 2016 Authoret al.: Preparation of Papers for IEEE TRANSACTIONS...
2016
-
[41]
8599–8608
PMLR, 18–24 Jul 2021, pp. 8599–8608. [Online]. Available: https://proceedings.mlr.press/v139/popov21a.html
2021
-
[42]
S. Liao, Y . Wang, T. Li, Y . Cheng, R. Zhang, R. Zhou, and Y . Xing, “Fish-speech: Leveraging large language models for advanced multilingual text-to-speech synthesis,” 2024. [Online]. Available: https: //arxiv.org/abs/2411.01156
-
[43]
Rhvoice: A free and open-source speech synthesizer,
RHV oice Project, “Rhvoice: A free and open-source speech synthesizer,” https://github.com/RHV oice/RHV oice, 2022, gitHub repository, README documents features and platform interfaces (SAPI5, Speech Dispatcher, Android TTS APIs). [Online]. Available: https://github.com/RHV oice/RHV oice
2022
-
[44]
Silero models: Pre-trained enterprise-grade stt / tts models and benchmarks,
Silero Team, “Silero models: Pre-trained enterprise-grade stt / tts models and benchmarks,” https://github.com/snakers4/silero-models, 2021
2021
-
[45]
Neural speech synthesis with transformer network,
N. Li, S. Liu, Y . Liu, S. Zhao, and M. Liu, “Neural speech synthesis with transformer network,” ser. AAAI’19/IAAI’19/EAAI’19. AAAI Press,
-
[46]
Available: https://doi.org/10.1609/aaai.v33i01.33016706
[Online]. Available: https://doi.org/10.1609/aaai.v33i01.33016706
-
[47]
JunyiAo,RuiWang,LongZhou,ChengyiWang,ShuoRen,YuWu,ShujieLiu,TomKo,QingLi,YuZhang,etal
J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y . Wu, S. Liu, T. Ko, Q. Li, Y . Zhang, Z. Wei, Y . Qian, J. Li, and F. Wei, “Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing,” 2022. [Online]. Available: https://arxiv.org/abs/2110.07205
-
[48]
NISQA: A Deep CNN- Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,
G. Mittag, B. Naderi, A. Chehadi, and S. Möller, “NISQA: A Deep CNN- Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” in Interspeech 2021, 2021, pp. 2127–2131
2021
-
[49]
Robust Speech Recognition via Large-Scale Weak Supervision
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022. [Online]. Available: https://arxiv.org/abs/2212.04356
work page internal anchor Pith review arXiv 2022
-
[50]
Russian speech recognition system based on mozilla’s deep- speech tensorflow implementation,
G. Fedoseev, “Russian speech recognition system based on mozilla’s deep- speech tensorflow implementation,” https://github.com/GeorgeFedoseev/ DeepSpeech, 2017, forked from mozilla/DeepSpeech
2017
-
[51]
Golos: Russian Dataset for Speech Research,
N. Karpov, A. Denisenko, and F. Minkin, “Golos: Russian Dataset for Speech Research,” in Interspeech 2021, 2021, pp. 1419–1423
2021
-
[52]
The m-ailabs speech dataset,
I. C. A. Solak and D. Naumov, “The m-ailabs speech dataset,” https: //github.com/i-celeste-aurora/m-ailabs-dataset, 2017
2017
-
[53]
Openstt: The russian open speech-to-text dataset,
A. Bondarenko, “Openstt: The russian open speech-to-text dataset,” https: //github.com/snakers4/open_stt, 2019
2019
-
[54]
A data-centric framework for addressing phonetic and prosodic challenges in russian speech generative models,
K. Borodin, N. Vasiliev, V . Kudryavtsev, M. Maslov, M. Gorodnichev, O. Rogov, and G. Mkrtchian, “A data-centric framework for addressing phonetic and prosodic challenges in russian speech generative models,”
-
[55]
Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech
[Online]. Available: https://arxiv.org/abs/2507.13563
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
Russian librispeech (ruls) dataset,
“Russian librispeech (ruls) dataset,” https://openslr.org/96/, 2021
2021
-
[57]
Ruslan: Russian spoken language corpus for speech synthesis,
L. Gabdrakhmanov, R. Garaev, and E. Razinkov, “Ruslan: Russian spoken language corpus for speech synthesis,” in Speech and Computer. Cham: Springer International Publishing, 2019, pp. 113–121
2019
-
[58]
Common voice: A massively-multilingual speech corpus,
R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the Twelfth Language Resources and Evaluation Conference, N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. M...
2020
-
[59]
Sova dataset: Multilingual stt/asr corpus,
SOV A AI, “Sova dataset: Multilingual stt/asr corpus,” https://github.com/ sovaai/sova-dataset, 2022
2022
-
[60]
ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,
M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. H. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,” in Interspeech 2019, 2019, pp. 1008–1012. KSENIA L YSIKOVAwas born in Moscow, Rus- sia, in 2004. She is currently pursuing the B.S. de- gree in computer scienc...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.