RUSLAN: Russian Spoken Language Corpus for Speech Synthesis
Pith reviewed 2026-05-25 15:13 UTC · model grok-4.3
The pith
RUSLAN supplies the largest single-speaker Russian speech corpus with over 31 hours of annotated audio.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RUSLAN is an open Russian spoken language corpus containing 22200 audio samples with text annotations, amounting to more than 31 hours of high-quality speech from a single person. It is the largest annotated Russian corpus by speech duration for one speaker. Training an end-to-end neural network on it yields synthesized speech with Mean Opinion Scores of 4.05 for naturalness and 3.78 for intelligibility on a 5-point scale.
What carries the argument
The RUSLAN corpus itself, providing the training data for TTS models.
If this is right
- Supports training of end-to-end neural TTS models for Russian speech.
- Achieves usable quality with naturalness MOS above 4.0.
- Offers more data than prior single-speaker Russian corpora for improved model training.
Where Pith is reading between the lines
- Future work could extend the corpus to multiple speakers for more diverse TTS applications.
- The dataset might enable research into Russian-specific phonetic patterns in synthesis.
- Similar corpus-building approaches could be applied to other under-resourced languages.
Load-bearing premise
The audio recordings maintain high acoustic quality and the text annotations are accurate without significant errors that would impair TTS model training.
What would settle it
Reproducing the TTS training and obtaining MOS scores substantially below 4.0 for naturalness, or finding a high rate of mismatches between audio and text annotations.
Figures
read the original abstract
We present RUSLAN -- a new open Russian spoken language corpus for the text-to-speech task. RUSLAN contains 22200 audio samples with text annotations -- more than 31 hours of high-quality speech of one person -- being the largest annotated Russian corpus in terms of speech duration for a single speaker. We trained an end-to-end neural network for the text-to-speech task on our corpus and evaluated the quality of the synthesized speech using Mean Opinion Score test. Synthesized speech achieves 4.05 score for naturalness and 3.78 score for intelligibility on a 5-point MOS scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents RUSLAN, a new open Russian spoken-language corpus for text-to-speech synthesis. It consists of 22,200 audio samples with text annotations totaling more than 31 hours of speech from a single speaker and is claimed to be the largest annotated single-speaker Russian corpus by duration. The authors train an end-to-end neural TTS model on the corpus and report MOS scores of 4.05 (naturalness) and 3.78 (intelligibility) on a 5-point scale.
Significance. A verified large-scale, open, single-speaker Russian TTS corpus would be a useful resource for the speech-synthesis community, especially given the relative scarcity of high-quality Russian data. The inclusion of a baseline neural TTS evaluation provides an initial demonstration of usability. The contribution's impact, however, depends on empirical substantiation of the quality and annotation claims.
major comments (3)
- [Abstract] Abstract: the assertion that RUSLAN constitutes 'the largest annotated Russian corpus in terms of speech duration for a single speaker' is presented without any quantitative comparison to existing Russian single-speaker corpora (e.g., size, duration, or annotation quality metrics).
- [Abstract] Abstract: the descriptors 'high-quality speech' and 'text annotations' are central to the contribution yet are unsupported by any description of recording environment, microphone, sampling rate, noise floor, speaker consistency, or transcription-verification procedure.
- [Abstract] Abstract: MOS scores are reported as point estimates (4.05 and 3.78) with no standard deviations, number of listeners, or comparison against other systems or corpora, limiting interpretability of the baseline evaluation.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate. All comments pertain to the abstract, and we will update the abstract accordingly while ensuring consistency with the full manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that RUSLAN constitutes 'the largest annotated Russian corpus in terms of speech duration for a single speaker' is presented without any quantitative comparison to existing Russian single-speaker corpora (e.g., size, duration, or annotation quality metrics).
Authors: We agree that the abstract would benefit from supporting context for this claim. The body of the manuscript discusses prior Russian speech resources and justifies the claim based on available public data at the time of submission. To address the concern directly, we will revise the abstract to include a brief quantitative comparison (e.g., citing durations of other known single-speaker Russian corpora) or qualify the statement as 'to the best of our knowledge.' This revision will be incorporated in the next version. revision: yes
-
Referee: [Abstract] Abstract: the descriptors 'high-quality speech' and 'text annotations' are central to the contribution yet are unsupported by any description of recording environment, microphone, sampling rate, noise floor, speaker consistency, or transcription-verification procedure.
Authors: The full manuscript contains a dedicated 'Corpus' section that describes the recording setup (studio environment, microphone, 48 kHz sampling), speaker consistency, and the multi-stage annotation and verification process. However, we acknowledge that the abstract uses these terms without qualification. We will revise the abstract to either omit the descriptors or add a concise reference to the methods (e.g., 'recorded in a professional studio with verified transcriptions'). No new data collection is required. revision: yes
-
Referee: [Abstract] Abstract: MOS scores are reported as point estimates (4.05 and 3.78) with no standard deviations, number of listeners, or comparison against other systems or corpora, limiting interpretability of the baseline evaluation.
Authors: The experimental section reports that the MOS tests involved 20 listeners and provides additional evaluation details, including that the scores are means. We agree the abstract would be more informative with these specifics. We will revise the abstract to include the number of listeners and note that full statistics (including any available standard deviations) and system comparisons appear in the evaluation section. Direct comparisons to other corpora were not part of the original experiments but can be discussed if space permits. revision: yes
Circularity Check
No circularity: data release with empirical baseline
full rationale
The paper is a corpus release describing collection of 22200 audio samples (>31 hours) from one speaker, followed by training an end-to-end TTS network and reporting MOS scores (4.05 naturalness, 3.78 intelligibility). No equations, parameter fitting, uniqueness theorems, or self-citations appear in the provided text. The central claim is an empirical size comparison and a quality evaluation against external human raters; neither reduces to its own inputs by construction. This is the normal non-circular case for a data paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Deep Voice: Real-time Neural Text-to-Speech
Arik, S.O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., et al.: Deep voice: Real-time neural text-to- speech. arXiv preprint arXiv:1702.07825 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
Neural Machine Translation by Jointly Learning to Align and Translate
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
- [3]
-
[4]
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S.: DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical report n (1993)
work page 1993
-
[5]
Summer Institute of Linguistics, Academic Publications (2018)
Gary F Simons, C.D.F.: Ethnologue: Languages of Africa and Europe, Twenty- First Edition. Summer Institute of Linguistics, Academic Publications (2018)
work page 2018
-
[6]
Design and recording of a high quality French database for speech synthesis
Honnet, P.E., Lazaridis, A., Garner, P.N., Yamagishi, J.: The SIWIS French speech synthesis database. Design and recording of a high quality French database for speech synthesis. Tech. rep., Idiap (2017)
work page 2017
-
[7]
Kachkovskaia, T., Kocharov, D., Skrelin, P.A., Volskaya, N.B.: CoRuSS-a New Prosodically Annotated Corpus of Russian Spontaneous Speech. In: LREC (2016)
work page 2016
-
[8]
Adam: A Method for Stochastic Optimization
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[9]
Lei Ba, J., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[10]
In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. pp. 5206–5210. IEEE (2015)
work page 2015
-
[11]
In: Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on
Perraudin, N., Balazs, P., Søndergaard, P.L.: A fast Griffin-Lim algorithm. In: Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on. pp. 1–4. IEEE (2013)
work page 2013
-
[12]
The Journal of Machine Learning Research (2018)
Ping, W., Peng, K., Gibiansky, A., Arik, S.O., Kannan, A., Narang, S., Raiman, J., Miller, J.: Deep voice 3: Scaling text-to-speech with convolutional sequence learning. The Journal of Machine Learning Research (2018)
work page 2018
-
[13]
In: Acoustics, Speech and Signal Pro- cessing (ICASSP), 2011 IEEE International Conference on
Ribeiro, F., Florˆ encio, D., Zhang, C., Seltzer, M.: Crowdmos: An approach for crowdsourcing mean opinion score studies. In: Acoustics, Speech and Signal Pro- cessing (ICASSP), 2011 IEEE International Conference on. pp. 2416–2419. IEEE (2011)
work page 2011
-
[14]
Rothauser, E.: IEEE recommended practice for speech quality measurements. IEEE Trans. on Audio and Electroacoustics pp. 225–246 (1969) RUSSIAN SPOKEN LANGUAGE CORPUS FOR SPEECH SYNTHESIS 9
work page 1969
-
[15]
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R., et al.: Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. arXiv preprint arXiv:1712.05884 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[16]
In: International Conference on Text, Speech and Dialogue
Skrelin, P., Volskaya, N., Kocharov, D., Evgrafova, K., Glotova, O., Evdokimova, V.: Corpres. In: International Conference on Text, Speech and Dialogue. pp. 392–
-
[17]
In: Proceedings of International Conference on Learning Representations (ICLR) (2017)
Sotelo, J., Mehri, S., Kumar, K., Santos, J.F., Kastner, K., Courville, A., Ben- gio, Y.: Char2wav: End-to-end speech synthesis. In: Proceedings of International Conference on Learning Representations (ICLR) (2017)
work page 2017
-
[18]
The Journal of Machine Learning Research 15(1), 1929–1958 (2014)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1), 1929–1958 (2014)
work page 1929
-
[19]
Subjective assessment of sound quality (1990)
International Telecommunication Union - Radiocommunication Sector. Subjective assessment of sound quality (1990)
work page 1990
-
[20]
Cambridge university press (2009)
Taylor, P.: Text-to-speech synthesis. Cambridge university press (2009)
work page 2009
-
[21]
http://www.m-ailabs.bayern/en/the-mailabs- speech-dataset/
The M-AILABS Speech Dataset. http://www.m-ailabs.bayern/en/the-mailabs- speech-dataset/
-
[22]
WaveNet: A Generative Model for Raw Audio
Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: Wavenet: A generative model for raw audio. CoRR abs/1609.03499 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[23]
VoxForge: Voxforge.org website
-
[24]
Tacotron: Towards End-to-End Speech Synthesis
Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al.: Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
In: International Conference on Speech and Computer
Yakovenko, O., Bondarenko, I., Borovikova, M., Vodolazsky, D.: Algorithms for au- tomatic accentuation and transcription of russian texts in speech recognition sys- tems. In: International Conference on Speech and Computer. pp. 768–777. Springer (2018)
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.