pith. sign in

arxiv: 1906.11645 · v1 · pith:ZQ3ZXTKPnew · submitted 2019-06-26 · 📡 eess.AS · cs.LG· cs.SD· stat.ML

RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

Pith reviewed 2026-05-25 15:13 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SDstat.ML
keywords Russian corpusspeech synthesistext-to-speechsingle speakerannotated datasetneural TTSMOS test
0
0 comments X

The pith

RUSLAN supplies the largest single-speaker Russian speech corpus with over 31 hours of annotated audio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RUSLAN as a new open corpus for Russian text-to-speech. It consists of 22200 audio samples totaling more than 31 hours from one speaker, claimed to be the largest such annotated resource. The authors train an end-to-end neural TTS model on this data and report MOS scores of 4.05 for naturalness and 3.78 for intelligibility. This matters for developing better speech synthesis systems in Russian, where large single-speaker datasets have been limited.

Core claim

RUSLAN is an open Russian spoken language corpus containing 22200 audio samples with text annotations, amounting to more than 31 hours of high-quality speech from a single person. It is the largest annotated Russian corpus by speech duration for one speaker. Training an end-to-end neural network on it yields synthesized speech with Mean Opinion Scores of 4.05 for naturalness and 3.78 for intelligibility on a 5-point scale.

What carries the argument

The RUSLAN corpus itself, providing the training data for TTS models.

If this is right

  • Supports training of end-to-end neural TTS models for Russian speech.
  • Achieves usable quality with naturalness MOS above 4.0.
  • Offers more data than prior single-speaker Russian corpora for improved model training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could extend the corpus to multiple speakers for more diverse TTS applications.
  • The dataset might enable research into Russian-specific phonetic patterns in synthesis.
  • Similar corpus-building approaches could be applied to other under-resourced languages.

Load-bearing premise

The audio recordings maintain high acoustic quality and the text annotations are accurate without significant errors that would impair TTS model training.

What would settle it

Reproducing the TTS training and obtaining MOS scores substantially below 4.0 for naturalness, or finding a high rate of mismatches between audio and text annotations.

Figures

Figures reproduced from arXiv: 1906.11645 by Evgenii Razinkov, Lenar Gabdrakhmanov, Rustem Garaev.

Figure 1
Figure 1. Figure 1: Distribution of the Russian phonemes in the corpus. 2.1 Text preprocessing Text for each training sample was preprocessed in the following way: – All numbers and dates were manually replaced by their textual representa￾tion. – Acronyms were manually substituted with their expanded forms. – All symbols except for Russian letters and punctuation marks were auto￾matically deleted. 2.2 Recording process Audio … view at source ↗
Figure 2
Figure 2. Figure 2: Histograms (a) of the duration of samples, (b) of the number of symbols. Loss function Since Decoder RNN predicts MFCCs and post-processing CBHL module predicts linear spectrogram, we employ two different loss functions. Target values for Decoder RNN are 80-band MFCCs: Lossmel = 1 N X N i |t mel i − ymel(texti)|1, (1) where N is the number of samples in the training set, texti is the i-th text from the cor… view at source ↗
Figure 3
Figure 3. Figure 3: Model architecture. The signal is being recovered iteratively, we stop the process after 300 itera￾tions. Optimization speed α was set to 0.99. 3.2 Training Text from each text-audio pair from RUSLAN corpus was used as a training sample and corresponding audio was used to obtain target variables, MFCCs and a linear spectrogram. Our model implementation had been training for 300K iterations with a batch siz… view at source ↗
read the original abstract

We present RUSLAN -- a new open Russian spoken language corpus for the text-to-speech task. RUSLAN contains 22200 audio samples with text annotations -- more than 31 hours of high-quality speech of one person -- being the largest annotated Russian corpus in terms of speech duration for a single speaker. We trained an end-to-end neural network for the text-to-speech task on our corpus and evaluated the quality of the synthesized speech using Mean Opinion Score test. Synthesized speech achieves 4.05 score for naturalness and 3.78 score for intelligibility on a 5-point MOS scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript presents RUSLAN, a new open Russian spoken-language corpus for text-to-speech synthesis. It consists of 22,200 audio samples with text annotations totaling more than 31 hours of speech from a single speaker and is claimed to be the largest annotated single-speaker Russian corpus by duration. The authors train an end-to-end neural TTS model on the corpus and report MOS scores of 4.05 (naturalness) and 3.78 (intelligibility) on a 5-point scale.

Significance. A verified large-scale, open, single-speaker Russian TTS corpus would be a useful resource for the speech-synthesis community, especially given the relative scarcity of high-quality Russian data. The inclusion of a baseline neural TTS evaluation provides an initial demonstration of usability. The contribution's impact, however, depends on empirical substantiation of the quality and annotation claims.

major comments (3)
  1. [Abstract] Abstract: the assertion that RUSLAN constitutes 'the largest annotated Russian corpus in terms of speech duration for a single speaker' is presented without any quantitative comparison to existing Russian single-speaker corpora (e.g., size, duration, or annotation quality metrics).
  2. [Abstract] Abstract: the descriptors 'high-quality speech' and 'text annotations' are central to the contribution yet are unsupported by any description of recording environment, microphone, sampling rate, noise floor, speaker consistency, or transcription-verification procedure.
  3. [Abstract] Abstract: MOS scores are reported as point estimates (4.05 and 3.78) with no standard deviations, number of listeners, or comparison against other systems or corpora, limiting interpretability of the baseline evaluation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate. All comments pertain to the abstract, and we will update the abstract accordingly while ensuring consistency with the full manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that RUSLAN constitutes 'the largest annotated Russian corpus in terms of speech duration for a single speaker' is presented without any quantitative comparison to existing Russian single-speaker corpora (e.g., size, duration, or annotation quality metrics).

    Authors: We agree that the abstract would benefit from supporting context for this claim. The body of the manuscript discusses prior Russian speech resources and justifies the claim based on available public data at the time of submission. To address the concern directly, we will revise the abstract to include a brief quantitative comparison (e.g., citing durations of other known single-speaker Russian corpora) or qualify the statement as 'to the best of our knowledge.' This revision will be incorporated in the next version. revision: yes

  2. Referee: [Abstract] Abstract: the descriptors 'high-quality speech' and 'text annotations' are central to the contribution yet are unsupported by any description of recording environment, microphone, sampling rate, noise floor, speaker consistency, or transcription-verification procedure.

    Authors: The full manuscript contains a dedicated 'Corpus' section that describes the recording setup (studio environment, microphone, 48 kHz sampling), speaker consistency, and the multi-stage annotation and verification process. However, we acknowledge that the abstract uses these terms without qualification. We will revise the abstract to either omit the descriptors or add a concise reference to the methods (e.g., 'recorded in a professional studio with verified transcriptions'). No new data collection is required. revision: yes

  3. Referee: [Abstract] Abstract: MOS scores are reported as point estimates (4.05 and 3.78) with no standard deviations, number of listeners, or comparison against other systems or corpora, limiting interpretability of the baseline evaluation.

    Authors: The experimental section reports that the MOS tests involved 20 listeners and provides additional evaluation details, including that the scores are means. We agree the abstract would be more informative with these specifics. We will revise the abstract to include the number of listeners and note that full statistics (including any available standard deviations) and system comparisons appear in the evaluation section. Direct comparisons to other corpora were not part of the original experiments but can be discussed if space permits. revision: yes

Circularity Check

0 steps flagged

No circularity: data release with empirical baseline

full rationale

The paper is a corpus release describing collection of 22200 audio samples (>31 hours) from one speaker, followed by training an end-to-end TTS network and reporting MOS scores (4.05 naturalness, 3.78 intelligibility). No equations, parameter fitting, uniqueness theorems, or self-citations appear in the provided text. The central claim is an empirical size comparison and a quality evaluation against external human raters; neither reduces to its own inputs by construction. This is the normal non-circular case for a data paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset release paper. The central claim rests on the existence, scale, and quality of the collected recordings rather than on any mathematical axioms, fitted parameters, or invented entities.

pith-pipeline@v0.9.0 · 5640 in / 985 out tokens · 23054 ms · 2026-05-25T15:13:32.649039+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 7 internal anchors

  1. [1]

    Deep Voice: Real-time Neural Text-to-Speech

    Arik, S.O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., et al.: Deep voice: Real-time neural text-to- speech. arXiv preprint arXiv:1702.07825 (2017)

  2. [2]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  3. [3]

    http://festvox.org /festival/

    Festvox: Festvox project. http://festvox.org /festival/

  4. [4]

    NIST speech disc 1-1.1

    Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S.: DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical report n (1993)

  5. [5]

    Summer Institute of Linguistics, Academic Publications (2018)

    Gary F Simons, C.D.F.: Ethnologue: Languages of Africa and Europe, Twenty- First Edition. Summer Institute of Linguistics, Academic Publications (2018)

  6. [6]

    Design and recording of a high quality French database for speech synthesis

    Honnet, P.E., Lazaridis, A., Garner, P.N., Yamagishi, J.: The SIWIS French speech synthesis database. Design and recording of a high quality French database for speech synthesis. Tech. rep., Idiap (2017)

  7. [7]

    In: LREC (2016)

    Kachkovskaia, T., Kocharov, D., Skrelin, P.A., Volskaya, N.B.: CoRuSS-a New Prosodically Annotated Corpus of Russian Spontaneous Speech. In: LREC (2016)

  8. [8]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  9. [9]

    Layer Normalization

    Lei Ba, J., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  10. [10]

    In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on

    Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. pp. 5206–5210. IEEE (2015)

  11. [11]

    In: Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on

    Perraudin, N., Balazs, P., Søndergaard, P.L.: A fast Griffin-Lim algorithm. In: Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on. pp. 1–4. IEEE (2013)

  12. [12]

    The Journal of Machine Learning Research (2018)

    Ping, W., Peng, K., Gibiansky, A., Arik, S.O., Kannan, A., Narang, S., Raiman, J., Miller, J.: Deep voice 3: Scaling text-to-speech with convolutional sequence learning. The Journal of Machine Learning Research (2018)

  13. [13]

    In: Acoustics, Speech and Signal Pro- cessing (ICASSP), 2011 IEEE International Conference on

    Ribeiro, F., Florˆ encio, D., Zhang, C., Seltzer, M.: Crowdmos: An approach for crowdsourcing mean opinion score studies. In: Acoustics, Speech and Signal Pro- cessing (ICASSP), 2011 IEEE International Conference on. pp. 2416–2419. IEEE (2011)

  14. [14]

    IEEE Trans

    Rothauser, E.: IEEE recommended practice for speech quality measurements. IEEE Trans. on Audio and Electroacoustics pp. 225–246 (1969) RUSSIAN SPOKEN LANGUAGE CORPUS FOR SPEECH SYNTHESIS 9

  15. [15]

    Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

    Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R., et al.: Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. arXiv preprint arXiv:1712.05884 (2017)

  16. [16]

    In: International Conference on Text, Speech and Dialogue

    Skrelin, P., Volskaya, N., Kocharov, D., Evgrafova, K., Glotova, O., Evdokimova, V.: Corpres. In: International Conference on Text, Speech and Dialogue. pp. 392–

  17. [17]

    In: Proceedings of International Conference on Learning Representations (ICLR) (2017)

    Sotelo, J., Mehri, S., Kumar, K., Santos, J.F., Kastner, K., Courville, A., Ben- gio, Y.: Char2wav: End-to-end speech synthesis. In: Proceedings of International Conference on Learning Representations (ICLR) (2017)

  18. [18]

    The Journal of Machine Learning Research 15(1), 1929–1958 (2014)

    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1), 1929–1958 (2014)

  19. [19]

    Subjective assessment of sound quality (1990)

    International Telecommunication Union - Radiocommunication Sector. Subjective assessment of sound quality (1990)

  20. [20]

    Cambridge university press (2009)

    Taylor, P.: Text-to-speech synthesis. Cambridge university press (2009)

  21. [21]

    http://www.m-ailabs.bayern/en/the-mailabs- speech-dataset/

    The M-AILABS Speech Dataset. http://www.m-ailabs.bayern/en/the-mailabs- speech-dataset/

  22. [22]

    WaveNet: A Generative Model for Raw Audio

    Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: Wavenet: A generative model for raw audio. CoRR abs/1609.03499 (2016)

  23. [23]

    VoxForge: Voxforge.org website

  24. [24]

    Tacotron: Towards End-to-End Speech Synthesis

    Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al.: Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017)

  25. [25]

    In: International Conference on Speech and Computer

    Yakovenko, O., Bondarenko, I., Borovikova, M., Vodolazsky, D.: Algorithms for au- tomatic accentuation and transcription of russian texts in speech recognition sys- tems. In: International Conference on Speech and Computer. pp. 768–777. Springer (2018)