RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

Evgenii Razinkov; Lenar Gabdrakhmanov; Rustem Garaev

arxiv: 1906.11645 · v1 · pith:ZQ3ZXTKPnew · submitted 2019-06-26 · 📡 eess.AS · cs.LG· cs.SD· stat.ML

RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

Lenar Gabdrakhmanov , Rustem Garaev , Evgenii Razinkov This is my paper

Pith reviewed 2026-05-25 15:13 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SDstat.ML

keywords Russian corpusspeech synthesistext-to-speechsingle speakerannotated datasetneural TTSMOS test

0 comments

The pith

RUSLAN supplies the largest single-speaker Russian speech corpus with over 31 hours of annotated audio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RUSLAN as a new open corpus for Russian text-to-speech. It consists of 22200 audio samples totaling more than 31 hours from one speaker, claimed to be the largest such annotated resource. The authors train an end-to-end neural TTS model on this data and report MOS scores of 4.05 for naturalness and 3.78 for intelligibility. This matters for developing better speech synthesis systems in Russian, where large single-speaker datasets have been limited.

Core claim

RUSLAN is an open Russian spoken language corpus containing 22200 audio samples with text annotations, amounting to more than 31 hours of high-quality speech from a single person. It is the largest annotated Russian corpus by speech duration for one speaker. Training an end-to-end neural network on it yields synthesized speech with Mean Opinion Scores of 4.05 for naturalness and 3.78 for intelligibility on a 5-point scale.

What carries the argument

The RUSLAN corpus itself, providing the training data for TTS models.

If this is right

Supports training of end-to-end neural TTS models for Russian speech.
Achieves usable quality with naturalness MOS above 4.0.
Offers more data than prior single-speaker Russian corpora for improved model training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future work could extend the corpus to multiple speakers for more diverse TTS applications.
The dataset might enable research into Russian-specific phonetic patterns in synthesis.
Similar corpus-building approaches could be applied to other under-resourced languages.

Load-bearing premise

The audio recordings maintain high acoustic quality and the text annotations are accurate without significant errors that would impair TTS model training.

What would settle it

Reproducing the TTS training and obtaining MOS scores substantially below 4.0 for naturalness, or finding a high rate of mismatches between audio and text annotations.

Figures

Figures reproduced from arXiv: 1906.11645 by Evgenii Razinkov, Lenar Gabdrakhmanov, Rustem Garaev.

**Figure 1.** Figure 1: Distribution of the Russian phonemes in the corpus. 2.1 Text preprocessing Text for each training sample was preprocessed in the following way: – All numbers and dates were manually replaced by their textual representation. – Acronyms were manually substituted with their expanded forms. – All symbols except for Russian letters and punctuation marks were automatically deleted. 2.2 Recording process Audio … view at source ↗

**Figure 2.** Figure 2: Histograms (a) of the duration of samples, (b) of the number of symbols. Loss function Since Decoder RNN predicts MFCCs and post-processing CBHL module predicts linear spectrogram, we employ two different loss functions. Target values for Decoder RNN are 80-band MFCCs: Lossmel = 1 N X N i |t mel i − ymel(texti)|1, (1) where N is the number of samples in the training set, texti is the i-th text from the cor… view at source ↗

**Figure 3.** Figure 3: Model architecture. The signal is being recovered iteratively, we stop the process after 300 iterations. Optimization speed α was set to 0.99. 3.2 Training Text from each text-audio pair from RUSLAN corpus was used as a training sample and corresponding audio was used to obtain target variables, MFCCs and a linear spectrogram. Our model implementation had been training for 300K iterations with a batch siz… view at source ↗

read the original abstract

We present RUSLAN -- a new open Russian spoken language corpus for the text-to-speech task. RUSLAN contains 22200 audio samples with text annotations -- more than 31 hours of high-quality speech of one person -- being the largest annotated Russian corpus in terms of speech duration for a single speaker. We trained an end-to-end neural network for the text-to-speech task on our corpus and evaluated the quality of the synthesized speech using Mean Opinion Score test. Synthesized speech achieves 4.05 score for naturalness and 3.78 score for intelligibility on a 5-point MOS scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RUSLAN releases a new 31-hour single-speaker Russian speech corpus for TTS that fills a real gap, but the abstract gives almost no evidence on recording quality or annotation accuracy.

read the letter

The paper's main point is the release of RUSLAN, an open corpus with 22200 samples totaling over 31 hours from one speaker, positioned as the largest annotated single-speaker Russian dataset for speech synthesis. They also train a basic end-to-end TTS model and report MOS scores of 4.05 for naturalness and 3.78 for intelligibility. That data release is the actual new thing here, and it is the kind of resource that can help people working on Russian TTS without having to start from scratch. Open release plus a working baseline is straightforward and useful for the subfield. The soft spots are exactly where the stress-test note flags them. The abstract states the size and quality claims but supplies no information on microphone, room acoustics, sampling rate, noise levels, speaker consistency, or how the text annotations were verified against the audio. Without those details the 'high-quality' and 'annotated' parts rest on assertion rather than evidence, and the size comparison to prior work is not shown. The MOS numbers are presented without error bars, listener count, or test protocol. If the full manuscript adds proper sections on collection procedure and evaluation setup, those concerns shrink; if not, the utility claims stay hard to assess. This is a standard corpus paper aimed at TTS researchers who need Russian data or are building multilingual systems. Readers who actually plan to download and use the corpus will get the most value, provided the quality holds up on inspection. It is coherent on its own terms and deserves a serious referee to check the data documentation and any added comparisons, even if the technical novelty is modest.

Referee Report

3 major / 0 minor

Summary. The manuscript presents RUSLAN, a new open Russian spoken-language corpus for text-to-speech synthesis. It consists of 22,200 audio samples with text annotations totaling more than 31 hours of speech from a single speaker and is claimed to be the largest annotated single-speaker Russian corpus by duration. The authors train an end-to-end neural TTS model on the corpus and report MOS scores of 4.05 (naturalness) and 3.78 (intelligibility) on a 5-point scale.

Significance. A verified large-scale, open, single-speaker Russian TTS corpus would be a useful resource for the speech-synthesis community, especially given the relative scarcity of high-quality Russian data. The inclusion of a baseline neural TTS evaluation provides an initial demonstration of usability. The contribution's impact, however, depends on empirical substantiation of the quality and annotation claims.

major comments (3)

[Abstract] Abstract: the assertion that RUSLAN constitutes 'the largest annotated Russian corpus in terms of speech duration for a single speaker' is presented without any quantitative comparison to existing Russian single-speaker corpora (e.g., size, duration, or annotation quality metrics).
[Abstract] Abstract: the descriptors 'high-quality speech' and 'text annotations' are central to the contribution yet are unsupported by any description of recording environment, microphone, sampling rate, noise floor, speaker consistency, or transcription-verification procedure.
[Abstract] Abstract: MOS scores are reported as point estimates (4.05 and 3.78) with no standard deviations, number of listeners, or comparison against other systems or corpora, limiting interpretability of the baseline evaluation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate. All comments pertain to the abstract, and we will update the abstract accordingly while ensuring consistency with the full manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that RUSLAN constitutes 'the largest annotated Russian corpus in terms of speech duration for a single speaker' is presented without any quantitative comparison to existing Russian single-speaker corpora (e.g., size, duration, or annotation quality metrics).

Authors: We agree that the abstract would benefit from supporting context for this claim. The body of the manuscript discusses prior Russian speech resources and justifies the claim based on available public data at the time of submission. To address the concern directly, we will revise the abstract to include a brief quantitative comparison (e.g., citing durations of other known single-speaker Russian corpora) or qualify the statement as 'to the best of our knowledge.' This revision will be incorporated in the next version. revision: yes
Referee: [Abstract] Abstract: the descriptors 'high-quality speech' and 'text annotations' are central to the contribution yet are unsupported by any description of recording environment, microphone, sampling rate, noise floor, speaker consistency, or transcription-verification procedure.

Authors: The full manuscript contains a dedicated 'Corpus' section that describes the recording setup (studio environment, microphone, 48 kHz sampling), speaker consistency, and the multi-stage annotation and verification process. However, we acknowledge that the abstract uses these terms without qualification. We will revise the abstract to either omit the descriptors or add a concise reference to the methods (e.g., 'recorded in a professional studio with verified transcriptions'). No new data collection is required. revision: yes
Referee: [Abstract] Abstract: MOS scores are reported as point estimates (4.05 and 3.78) with no standard deviations, number of listeners, or comparison against other systems or corpora, limiting interpretability of the baseline evaluation.

Authors: The experimental section reports that the MOS tests involved 20 listeners and provides additional evaluation details, including that the scores are means. We agree the abstract would be more informative with these specifics. We will revise the abstract to include the number of listeners and note that full statistics (including any available standard deviations) and system comparisons appear in the evaluation section. Direct comparisons to other corpora were not part of the original experiments but can be discussed if space permits. revision: yes

Circularity Check

0 steps flagged

No circularity: data release with empirical baseline

full rationale

The paper is a corpus release describing collection of 22200 audio samples (>31 hours) from one speaker, followed by training an end-to-end TTS network and reporting MOS scores (4.05 naturalness, 3.78 intelligibility). No equations, parameter fitting, uniqueness theorems, or self-citations appear in the provided text. The central claim is an empirical size comparison and a quality evaluation against external human raters; neither reduces to its own inputs by construction. This is the normal non-circular case for a data paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset release paper. The central claim rests on the existence, scale, and quality of the collected recordings rather than on any mathematical axioms, fitted parameters, or invented entities.

pith-pipeline@v0.9.0 · 5640 in / 985 out tokens · 23054 ms · 2026-05-25T15:13:32.649039+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 7 internal anchors

[1]

Deep Voice: Real-time Neural Text-to-Speech

Arik, S.O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., et al.: Deep voice: Real-time neural text-to- speech. arXiv preprint arXiv:1702.07825 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[3]

http://festvox.org /festival/

Festvox: Festvox project. http://festvox.org /festival/

work page
[4]

NIST speech disc 1-1.1

Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S.: DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical report n (1993)

work page 1993
[5]

Summer Institute of Linguistics, Academic Publications (2018)

Gary F Simons, C.D.F.: Ethnologue: Languages of Africa and Europe, Twenty- First Edition. Summer Institute of Linguistics, Academic Publications (2018)

work page 2018
[6]

Design and recording of a high quality French database for speech synthesis

Honnet, P.E., Lazaridis, A., Garner, P.N., Yamagishi, J.: The SIWIS French speech synthesis database. Design and recording of a high quality French database for speech synthesis. Tech. rep., Idiap (2017)

work page 2017
[7]

In: LREC (2016)

Kachkovskaia, T., Kocharov, D., Skrelin, P.A., Volskaya, N.B.: CoRuSS-a New Prosodically Annotated Corpus of Russian Spontaneous Speech. In: LREC (2016)

work page 2016
[8]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[9]

Layer Normalization

Lei Ba, J., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[10]

In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on

Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. pp. 5206–5210. IEEE (2015)

work page 2015
[11]

In: Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on

Perraudin, N., Balazs, P., Søndergaard, P.L.: A fast Griﬃn-Lim algorithm. In: Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on. pp. 1–4. IEEE (2013)

work page 2013
[12]

The Journal of Machine Learning Research (2018)

Ping, W., Peng, K., Gibiansky, A., Arik, S.O., Kannan, A., Narang, S., Raiman, J., Miller, J.: Deep voice 3: Scaling text-to-speech with convolutional sequence learning. The Journal of Machine Learning Research (2018)

work page 2018
[13]

In: Acoustics, Speech and Signal Pro- cessing (ICASSP), 2011 IEEE International Conference on

Ribeiro, F., Florˆ encio, D., Zhang, C., Seltzer, M.: Crowdmos: An approach for crowdsourcing mean opinion score studies. In: Acoustics, Speech and Signal Pro- cessing (ICASSP), 2011 IEEE International Conference on. pp. 2416–2419. IEEE (2011)

work page 2011
[14]

IEEE Trans

Rothauser, E.: IEEE recommended practice for speech quality measurements. IEEE Trans. on Audio and Electroacoustics pp. 225–246 (1969) RUSSIAN SPOKEN LANGUAGE CORPUS FOR SPEECH SYNTHESIS 9

work page 1969
[15]

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R., et al.: Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. arXiv preprint arXiv:1712.05884 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

In: International Conference on Text, Speech and Dialogue

Skrelin, P., Volskaya, N., Kocharov, D., Evgrafova, K., Glotova, O., Evdokimova, V.: Corpres. In: International Conference on Text, Speech and Dialogue. pp. 392–

work page
[17]

In: Proceedings of International Conference on Learning Representations (ICLR) (2017)

Sotelo, J., Mehri, S., Kumar, K., Santos, J.F., Kastner, K., Courville, A., Ben- gio, Y.: Char2wav: End-to-end speech synthesis. In: Proceedings of International Conference on Learning Representations (ICLR) (2017)

work page 2017
[18]

The Journal of Machine Learning Research 15(1), 1929–1958 (2014)

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overﬁtting. The Journal of Machine Learning Research 15(1), 1929–1958 (2014)

work page 1929
[19]

Subjective assessment of sound quality (1990)

International Telecommunication Union - Radiocommunication Sector. Subjective assessment of sound quality (1990)

work page 1990
[20]

Cambridge university press (2009)

Taylor, P.: Text-to-speech synthesis. Cambridge university press (2009)

work page 2009
[21]

http://www.m-ailabs.bayern/en/the-mailabs- speech-dataset/

The M-AILABS Speech Dataset. http://www.m-ailabs.bayern/en/the-mailabs- speech-dataset/

work page
[22]

WaveNet: A Generative Model for Raw Audio

Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: Wavenet: A generative model for raw audio. CoRR abs/1609.03499 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

VoxForge: Voxforge.org website

work page
[24]

Tacotron: Towards End-to-End Speech Synthesis

Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al.: Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

In: International Conference on Speech and Computer

Yakovenko, O., Bondarenko, I., Borovikova, M., Vodolazsky, D.: Algorithms for au- tomatic accentuation and transcription of russian texts in speech recognition sys- tems. In: International Conference on Speech and Computer. pp. 768–777. Springer (2018)

work page 2018

[1] [1]

Deep Voice: Real-time Neural Text-to-Speech

Arik, S.O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., et al.: Deep voice: Real-time neural text-to- speech. arXiv preprint arXiv:1702.07825 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[3] [3]

http://festvox.org /festival/

Festvox: Festvox project. http://festvox.org /festival/

work page

[4] [4]

NIST speech disc 1-1.1

Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S.: DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical report n (1993)

work page 1993

[5] [5]

Summer Institute of Linguistics, Academic Publications (2018)

Gary F Simons, C.D.F.: Ethnologue: Languages of Africa and Europe, Twenty- First Edition. Summer Institute of Linguistics, Academic Publications (2018)

work page 2018

[6] [6]

Design and recording of a high quality French database for speech synthesis

Honnet, P.E., Lazaridis, A., Garner, P.N., Yamagishi, J.: The SIWIS French speech synthesis database. Design and recording of a high quality French database for speech synthesis. Tech. rep., Idiap (2017)

work page 2017

[7] [7]

In: LREC (2016)

Kachkovskaia, T., Kocharov, D., Skrelin, P.A., Volskaya, N.B.: CoRuSS-a New Prosodically Annotated Corpus of Russian Spontaneous Speech. In: LREC (2016)

work page 2016

[8] [8]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[9] [9]

Layer Normalization

Lei Ba, J., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[10] [10]

In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on

Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. pp. 5206–5210. IEEE (2015)

work page 2015

[11] [11]

In: Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on

Perraudin, N., Balazs, P., Søndergaard, P.L.: A fast Griﬃn-Lim algorithm. In: Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on. pp. 1–4. IEEE (2013)

work page 2013

[12] [12]

The Journal of Machine Learning Research (2018)

Ping, W., Peng, K., Gibiansky, A., Arik, S.O., Kannan, A., Narang, S., Raiman, J., Miller, J.: Deep voice 3: Scaling text-to-speech with convolutional sequence learning. The Journal of Machine Learning Research (2018)

work page 2018

[13] [13]

In: Acoustics, Speech and Signal Pro- cessing (ICASSP), 2011 IEEE International Conference on

Ribeiro, F., Florˆ encio, D., Zhang, C., Seltzer, M.: Crowdmos: An approach for crowdsourcing mean opinion score studies. In: Acoustics, Speech and Signal Pro- cessing (ICASSP), 2011 IEEE International Conference on. pp. 2416–2419. IEEE (2011)

work page 2011

[14] [14]

IEEE Trans

Rothauser, E.: IEEE recommended practice for speech quality measurements. IEEE Trans. on Audio and Electroacoustics pp. 225–246 (1969) RUSSIAN SPOKEN LANGUAGE CORPUS FOR SPEECH SYNTHESIS 9

work page 1969

[15] [15]

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R., et al.: Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. arXiv preprint arXiv:1712.05884 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [16]

In: International Conference on Text, Speech and Dialogue

Skrelin, P., Volskaya, N., Kocharov, D., Evgrafova, K., Glotova, O., Evdokimova, V.: Corpres. In: International Conference on Text, Speech and Dialogue. pp. 392–

work page

[17] [17]

In: Proceedings of International Conference on Learning Representations (ICLR) (2017)

Sotelo, J., Mehri, S., Kumar, K., Santos, J.F., Kastner, K., Courville, A., Ben- gio, Y.: Char2wav: End-to-end speech synthesis. In: Proceedings of International Conference on Learning Representations (ICLR) (2017)

work page 2017

[18] [18]

The Journal of Machine Learning Research 15(1), 1929–1958 (2014)

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overﬁtting. The Journal of Machine Learning Research 15(1), 1929–1958 (2014)

work page 1929

[19] [19]

Subjective assessment of sound quality (1990)

International Telecommunication Union - Radiocommunication Sector. Subjective assessment of sound quality (1990)

work page 1990

[20] [20]

Cambridge university press (2009)

Taylor, P.: Text-to-speech synthesis. Cambridge university press (2009)

work page 2009

[21] [21]

http://www.m-ailabs.bayern/en/the-mailabs- speech-dataset/

The M-AILABS Speech Dataset. http://www.m-ailabs.bayern/en/the-mailabs- speech-dataset/

work page

[22] [22]

WaveNet: A Generative Model for Raw Audio

Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: Wavenet: A generative model for raw audio. CoRR abs/1609.03499 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[23] [23]

VoxForge: Voxforge.org website

work page

[24] [24]

Tacotron: Towards End-to-End Speech Synthesis

Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al.: Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

In: International Conference on Speech and Computer

Yakovenko, O., Bondarenko, I., Borovikova, M., Vodolazsky, D.: Algorithms for au- tomatic accentuation and transcription of russian texts in speech recognition sys- tems. In: International Conference on Speech and Computer. pp. 768–777. Springer (2018)

work page 2018