End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training

Hong-chuan Wu; Li-Juan Liu; Li-Rong Dai; Peng-fei Wu; Yuan Jiang; Zhen-Hua Ling

arxiv: 1906.10859 · v1 · pith:666Y3YKInew · submitted 2019-06-26 · 📡 eess.AS · cs.SD

End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training

Peng-fei Wu , Zhen-Hua Ling , Li-Juan Liu , Yuan Jiang , Hong-chuan Wu , Li-Rong Dai This is my paper

Pith reviewed 2026-05-25 15:30 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords emotional speech synthesisglobal style tokenssemi-supervised trainingTacotroncross-entropy lossemotion categoriesinterpretabilityend-to-end synthesis

0 comments

The pith

Semi-supervised GST-Tacotron learns one-to-one emotion mappings from style tokens with only 5% labeled data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an end-to-end emotional speech synthesis method built on the GST-Tacotron framework that uses global style tokens to represent emotion categories. A cross-entropy loss is applied between the token weights and the emotion labels available in a small fraction of the data to enforce interpretability. This semi-supervised approach enables the model to produce emotional speech without requiring emotion labels on most training examples. Objective and subjective tests show the resulting model exceeds standard Tacotron performance at the 5% label level and reaches subjective quality close to a fully labeled Tacotron model.

Core claim

The GST-Tacotron model augmented with a cross-entropy loss between style token weights and emotion labels achieves one-to-one correspondence between style tokens and emotion categories, outperforming the conventional Tacotron model for emotional speech synthesis when only 5% of training data has emotion labels and attaining subjective performance close to the Tacotron model trained using all emotion labels.

What carries the argument

Global style tokens (GSTs) whose weights receive an auxiliary cross-entropy loss against a small set of emotion labels to enforce category-specific interpretability.

If this is right

Style tokens acquire direct one-to-one correspondence with distinct emotion categories.
The model achieves higher objective and subjective quality than standard Tacotron when only 5% of data carries emotion labels.
Subjective performance approaches that of a Tacotron model trained on 100% emotion labels.
Emotion recognition experiments on the learned tokens confirm the alignment between tokens and categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same token-weight loss construction could be reused to control other non-emotion speech attributes such as speaker identity or speaking rate with limited supervision.
The reduction in required labeled data may allow emotional synthesis systems to be built from existing unlabeled corpora augmented by a small annotation effort.
Testing the stability of the learned token-to-emotion mapping across new speakers or recording conditions would reveal whether the correspondence holds outside the training distribution.

Load-bearing premise

The cross-entropy loss between token weights and the small set of emotion labels will produce a reliable one-to-one mapping from style tokens to emotion categories that generalizes beyond the labeled subset.

What would settle it

If an independent emotion classifier applied to utterances generated from each individual style token fails to recover the intended emotion category for most tokens, the claimed one-to-one correspondence would be refuted.

Figures

Figures reproduced from arXiv: 1906.10859 by Hong-chuan Wu, Li-Juan Liu, Li-Rong Dai, Peng-fei Wu, Yuan Jiang, Zhen-Hua Ling.

read the original abstract

This paper proposes an end-to-end emotional speech synthesis (ESS) method which adopts global style tokens (GSTs) for semi-supervised training. This model is built based on the GST-Tacotron framework. The style tokens are defined to present emotion categories. A cross entropy loss function between token weights and emotion labels is designed to obtain the interpretability of style tokens utilizing the small portion of training data with emotion labels. Emotion recognition experiments confirm that this method can achieve one-to-one correspondence between style tokens and emotion categories effectively. Objective and subjective evaluation results show that our model outperforms the conventional Tacotron model for ESS when only 5\% of training data has emotion labels. Its subjective performance is close to the Tacotron model trained using all emotion labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a cross-entropy term on 5% emotion labels to GST-Tacotron and reports the resulting model beats plain Tacotron while nearing the fully supervised baseline.

read the letter

The core move here is straightforward: take the existing GST-Tacotron backbone, assign style tokens to emotion categories, and add a cross-entropy loss between token weights and the available labels on the small labeled slice. This is presented as the way to get interpretability without labeling everything. They then run emotion recognition to check that the tokens line up with categories and compare synthesis quality against Tacotron baselines with objective metrics plus listening tests. The headline result is that 5% labels suffice to get most of the way to the all-labels performance while beating the no-label case. That combination is the actual new piece; the rest builds directly on prior GST work. The experiments are the part that earns credit: they include both objective and subjective measures plus the recognition verification step, which is more than many similar papers bother with. The practical angle also lands: if labeling is expensive, showing usable results from mostly unlabeled data is worth knowing. The soft spots sit in the reporting and the mapping claim. The abstract gives no numbers, no dataset sizes, no split details, and no significance tests, so it is impossible to judge how close “close” actually is or whether the gains are reliable. The stress-test point about the cross-entropy only acting on the labeled 5% is fair; nothing in the architecture itself stops the tokens from picking up speaker or recording artifacts on the unlabeled majority, and the recognition experiments would need to be tight enough to rule that out. If those experiments are clean and held-out, the central claim stands; if they are not, the one-to-one correspondence is weaker than stated. This is for TTS groups that care about low-resource emotional control and already know the GST-Tacotron literature. A reader who wants a concrete semi-supervised recipe with listening-test backing will find it useful. It deserves peer review because the method is clear, the experiments exist, and the motivation is practical even if the numbers need closer inspection in the full text.

Referee Report

3 major / 2 minor

Summary. The paper proposes an end-to-end emotional speech synthesis (ESS) model extending the GST-Tacotron framework with semi-supervised training. Style tokens are trained to represent emotion categories via a cross-entropy loss applied to token weights using only 5% of the training data with emotion labels; the remaining 95% is unlabeled. Emotion recognition experiments are used to verify one-to-one token-to-emotion mapping. The central claim is that the resulting model outperforms a conventional Tacotron trained with 5% emotion labels and approaches the subjective performance of a Tacotron trained with 100% labels, supported by objective metrics, subjective listening tests, and the recognition verification.

Significance. If the mapping generalizes, the work would show a practical route to high-quality emotional TTS with minimal labeled data by leveraging GSTs and limited supervision. The combination of objective metrics, subjective tests, and an explicit emotion-recognition verification step is a methodological strength that supports interpretability claims. This could lower barriers to emotional speech synthesis in low-resource settings.

major comments (3)

[Abstract] Abstract: the headline claim that the model 'outperforms the conventional Tacotron model for ESS when only 5% of training data has emotion labels' and is 'close to the Tacotron model trained using all emotion labels' is stated without any numerical values for the objective metrics, MOS scores, dataset sizes, or statistical tests. This absence directly limits verification of the central performance claim.
[Emotion recognition experiments] Emotion recognition experiments (described in the abstract): the assertion of 'one-to-one correspondence between style tokens and emotion categories' rests on these experiments, yet no held-out set size, baseline accuracy, confusion matrix, or statistical significance is reported. Without this information it is impossible to confirm that the cross-entropy term on the 5% labeled subset produces a mapping that generalizes to the 95% unlabeled utterances rather than capturing speaker or recording artifacts.
[Method] Method section (cross-entropy loss definition): the auxiliary loss is computed exclusively on the labeled 5% subset; no analysis, ablation, or regularization term is described that would prevent the GST weights from encoding non-emotion factors on the unlabeled majority. This is load-bearing for the semi-supervised generalization claim.

minor comments (2)

[Abstract] The abstract would be clearer if it included at least the key numerical results (e.g., MCD or MOS deltas) alongside the qualitative statements.
[Method] Notation for the combined loss (Tacotron loss + cross-entropy) should be explicitly written as an equation for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve verifiability.

read point-by-point responses

Referee: [Abstract] the headline claim that the model 'outperforms the conventional Tacotron model for ESS when only 5% of training data has emotion labels' and is 'close to the Tacotron model trained using all emotion labels' is stated without any numerical values for the objective metrics, MOS scores, dataset sizes, or statistical tests. This absence directly limits verification of the central performance claim.

Authors: We agree that including concrete numbers strengthens the abstract. The revised version will report key objective metrics (MCD, F0 RMSE), MOS scores, dataset size (utterances/hours), and note statistical significance of the improvements over the 5%-label Tacotron baseline and proximity to the 100%-label model. revision: yes
Referee: [Emotion recognition experiments] the assertion of 'one-to-one correspondence between style tokens and emotion categories' rests on these experiments, yet no held-out set size, baseline accuracy, confusion matrix, or statistical significance is reported. Without this information it is impossible to confirm that the cross-entropy term on the 5% labeled subset produces a mapping that generalizes to the 95% unlabeled utterances rather than capturing speaker or recording artifacts.

Authors: We will augment the emotion-recognition section with the held-out test-set size, baseline recognizer accuracy, full confusion matrix, and statistical significance. These additions will allow readers to assess whether the learned mapping generalizes beyond the labeled subset. revision: yes
Referee: [Method] the auxiliary loss is computed exclusively on the labeled 5% subset; no analysis, ablation, or regularization term is described that would prevent the GST weights from encoding non-emotion factors on the unlabeled majority. This is load-bearing for the semi-supervised generalization claim.

Authors: The cross-entropy term supplies explicit emotion supervision on the labeled data while GSTs are jointly optimized on all utterances. In revision we will add an analysis of token-weight distributions on unlabeled data and an ablation comparing models trained with versus without the auxiliary loss to quantify its role in preventing non-emotion encoding. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central result is an empirical performance comparison: a GST-Tacotron variant trained with an auxiliary cross-entropy term on 5% emotion labels outperforms a standard Tacotron baseline (and approaches the fully-supervised case) on held-out objective and subjective metrics. This does not reduce to any of the enumerated circular patterns; the loss term is applied only to the labeled slice while synthesis quality is measured separately on unseen data. No self-citation chain, fitted parameter renamed as prediction, or self-definitional mapping is load-bearing for the reported gains. The emotion-recognition confirmation is presented as auxiliary validation rather than the derivation itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on the domain assumption that style tokens can be forced into emotion categories by a supervised loss on a small labeled fraction and that this mapping transfers to the unlabeled majority.

axioms (1)

domain assumption Style tokens can be made to correspond one-to-one with emotion categories via cross-entropy loss on a small labeled subset
This premise is invoked to justify the interpretability experiment and the semi-supervised training procedure.

pith-pipeline@v0.9.0 · 5672 in / 1216 out tokens · 37792 ms · 2026-05-25T15:30:48.398871+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 3 internal anchors

[1]

Statistical parametric speech synthesis using deep neural networks,

H. Zen, A. W. Senior, and M. Schuster, “Statistical parametric speech synthesis using deep neural networks,” inIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013 , 2013, pp. 7962–7966

work page 2013
[2]

Statistical parametric speech synthesis: from hmm to lstm-rnn,

H. Zen, “Statistical parametric speech synthesis: from hmm to lstm-rnn,” 2015, lecture given at RTTH Summer School on Speech Technology, Barcelona, Spain

work page 2015
[3]

TTS synthesis with bidirec- tional LSTM based recurrent neural networks,

Y . Fan, Y . Qian, F. Xie, and F. K. Soong, “TTS synthesis with bidirec- tional LSTM based recurrent neural networks,” in INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014 , 2014, pp. 1964–1968

work page 2014
[4]

WaveNet: A Generative Model for Raw Audio

A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, and O. V . et al., “Wavenet: A generative model for raw audio,” CoRR, vol. abs/1609.03499, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Tacotron: Towards end-to-end speech synthesis,

Y . Wang, R. J. Skerry-Ryan, D. Stanton, Y . Wu, and R. J. W. et al., “Tacotron: Towards end-to-end speech synthesis,” in Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017 , 2017, pp. 4006– 4010

work page 2017
[6]

Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, and N. J. et al., “Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018 , 2018, pp. 4779–4783

work page 2018
[7]

Informed blending of databases for emotional speech synthesis,

G. Hofer, K. Richmond, and R. A. J. Clark, “Informed blending of databases for emotional speech synthesis,” in INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, September 4-8, 2005, 2005, pp. 501–504

work page 2005
[8]

The generation of affect in synthesized speech,

J. Cahn, “The generation of affect in synthesized speech,” Journal of the American Voice I/O Society , vol. 8, pp. 1–19, 1990

work page 1990
[9]

Implementation and testing of a system for producing emotion-by-rule in synthetic speech,

I. R. Murray and J. L. Arnott, “Implementation and testing of a system for producing emotion-by-rule in synthetic speech,” Speech Communication, vol. 16, no. 4, pp. 369–390, 1995

work page 1995
[10]

Emotional speech synthesis: a review,

M. Schr ¨oder, “Emotional speech synthesis: a review,” in EUROSPEECH 2001 Scandinavia, 7th European Conference on Speech Communication and Technology, 2nd INTERSPEECH Event, Aalborg, Denmark, Septem- ber 3-7, 2001 , 2001, pp. 561–564

work page 2001
[11]

Acoustic modeling of speaking styles and emotional expressions in hmm-based speech synthesis,

J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, “Acoustic modeling of speaking styles and emotional expressions in hmm-based speech synthesis,” IEICE Transactions, vol. 88-D, no. 3, pp. 502–509, 2005

work page 2005
[12]

A style control technique for hmm-based expressive speech synthesis,

T. Nose, J. Yamagishi, T. Masuko, and T. Kobayashi, “A style control technique for hmm-based expressive speech synthesis,” IEICE Transac- tions, vol. 90-D, no. 9, pp. 1406–1413, 2007

work page 2007
[13]

Emotional statistical parametric speech synthesis using lstm-rnns,

S. An, Z. Ling, and L. Dai, “Emotional statistical parametric speech synthesis using lstm-rnns,” in 2017 Asia-Paciﬁc Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017, Kuala Lumpur, Malaysia, December 12-15, 2017, 2017, pp. 1613– 1616

work page 2017
[14]

Emotional End-to-End Neural Speech Synthesizer

Y . Lee, A. Rabiee, and S. Lee, “Emotional end-to-end neural speech synthesizer,” CoRR, vol. abs/1711.05447, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Investigating different representations for modeling and control- ling multiple emotions in dnn-based speech synthesis,

J. Lorenzo-Trueba, G. E. Henter, S. Takaki, J. Yamagishi, and Y . M. et al., “Investigating different representations for modeling and control- ling multiple emotions in dnn-based speech synthesis,” Speech Commu- nication, vol. 99, pp. 135–143, 2018

work page 2018
[16]

Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,

R. J. Skerry-Ryan, E. Battenberg, Y . Xiao, Y . Wang, and D. S. et al., “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm ¨assan, Stockholm, Sweden, July 10-15, 2018 , 2018, pp. 4700–4709

work page 2018
[17]

Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,

Y . Wang, D. Stanton, Y . Zhang, R. J. Skerry-Ryan, and E. B. et al., “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm ¨assan, Stockholm, Sweden, July 10-15, 2018 , 2018, pp. 5167–5176

work page 2018
[18]

Expressive speech synthesis via modeling expressions with variational autoencoder,

K. Akuzawa, Y . Iwasawa, and Y . Matsuo, “Expressive speech synthesis via modeling expressions with variational autoencoder,” in Interspeech 2018, 19th Annual Conference of the International Speech Communi- cation Association, Hyderabad, India, 2-6 September 2018. , 2018, pp. 3067–3071

work page 2018
[19]

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wave generation in end-to-end text-to-speech,” CoRR, vol. abs/1807.07281, 2018. [Online]. Available: http://arxiv.org/abs/1807.07281

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Reducing F0 frame error of F0 tracking algorithms under noisy conditions with an unvoiced/voiced classiﬁcation frontend,

Wei Chu and A. Alwan, “Reducing F0 frame error of F0 tracking algorithms under noisy conditions with an unvoiced/voiced classiﬁcation frontend,” in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing , 2009, pp. 3969–3972

work page 2009
[21]

Fastdtw: Toward accurate dynamic time warping in linear time and space,

S. Salvador and P. Chan, “Fastdtw: Toward accurate dynamic time warping in linear time and space,” inKDD workshop on mining temporal and sequential data . Citeseer, 2004. 5

work page 2004

[1] [1]

Statistical parametric speech synthesis using deep neural networks,

H. Zen, A. W. Senior, and M. Schuster, “Statistical parametric speech synthesis using deep neural networks,” inIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013 , 2013, pp. 7962–7966

work page 2013

[2] [2]

Statistical parametric speech synthesis: from hmm to lstm-rnn,

H. Zen, “Statistical parametric speech synthesis: from hmm to lstm-rnn,” 2015, lecture given at RTTH Summer School on Speech Technology, Barcelona, Spain

work page 2015

[3] [3]

TTS synthesis with bidirec- tional LSTM based recurrent neural networks,

Y . Fan, Y . Qian, F. Xie, and F. K. Soong, “TTS synthesis with bidirec- tional LSTM based recurrent neural networks,” in INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014 , 2014, pp. 1964–1968

work page 2014

[4] [4]

WaveNet: A Generative Model for Raw Audio

A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, and O. V . et al., “Wavenet: A generative model for raw audio,” CoRR, vol. abs/1609.03499, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Tacotron: Towards end-to-end speech synthesis,

Y . Wang, R. J. Skerry-Ryan, D. Stanton, Y . Wu, and R. J. W. et al., “Tacotron: Towards end-to-end speech synthesis,” in Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017 , 2017, pp. 4006– 4010

work page 2017

[6] [6]

Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, and N. J. et al., “Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018 , 2018, pp. 4779–4783

work page 2018

[7] [7]

Informed blending of databases for emotional speech synthesis,

G. Hofer, K. Richmond, and R. A. J. Clark, “Informed blending of databases for emotional speech synthesis,” in INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, September 4-8, 2005, 2005, pp. 501–504

work page 2005

[8] [8]

The generation of affect in synthesized speech,

J. Cahn, “The generation of affect in synthesized speech,” Journal of the American Voice I/O Society , vol. 8, pp. 1–19, 1990

work page 1990

[9] [9]

Implementation and testing of a system for producing emotion-by-rule in synthetic speech,

I. R. Murray and J. L. Arnott, “Implementation and testing of a system for producing emotion-by-rule in synthetic speech,” Speech Communication, vol. 16, no. 4, pp. 369–390, 1995

work page 1995

[10] [10]

Emotional speech synthesis: a review,

M. Schr ¨oder, “Emotional speech synthesis: a review,” in EUROSPEECH 2001 Scandinavia, 7th European Conference on Speech Communication and Technology, 2nd INTERSPEECH Event, Aalborg, Denmark, Septem- ber 3-7, 2001 , 2001, pp. 561–564

work page 2001

[11] [11]

Acoustic modeling of speaking styles and emotional expressions in hmm-based speech synthesis,

J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, “Acoustic modeling of speaking styles and emotional expressions in hmm-based speech synthesis,” IEICE Transactions, vol. 88-D, no. 3, pp. 502–509, 2005

work page 2005

[12] [12]

A style control technique for hmm-based expressive speech synthesis,

T. Nose, J. Yamagishi, T. Masuko, and T. Kobayashi, “A style control technique for hmm-based expressive speech synthesis,” IEICE Transac- tions, vol. 90-D, no. 9, pp. 1406–1413, 2007

work page 2007

[13] [13]

Emotional statistical parametric speech synthesis using lstm-rnns,

S. An, Z. Ling, and L. Dai, “Emotional statistical parametric speech synthesis using lstm-rnns,” in 2017 Asia-Paciﬁc Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017, Kuala Lumpur, Malaysia, December 12-15, 2017, 2017, pp. 1613– 1616

work page 2017

[14] [14]

Emotional End-to-End Neural Speech Synthesizer

Y . Lee, A. Rabiee, and S. Lee, “Emotional end-to-end neural speech synthesizer,” CoRR, vol. abs/1711.05447, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

Investigating different representations for modeling and control- ling multiple emotions in dnn-based speech synthesis,

J. Lorenzo-Trueba, G. E. Henter, S. Takaki, J. Yamagishi, and Y . M. et al., “Investigating different representations for modeling and control- ling multiple emotions in dnn-based speech synthesis,” Speech Commu- nication, vol. 99, pp. 135–143, 2018

work page 2018

[16] [16]

Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,

R. J. Skerry-Ryan, E. Battenberg, Y . Xiao, Y . Wang, and D. S. et al., “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm ¨assan, Stockholm, Sweden, July 10-15, 2018 , 2018, pp. 4700–4709

work page 2018

[17] [17]

Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,

Y . Wang, D. Stanton, Y . Zhang, R. J. Skerry-Ryan, and E. B. et al., “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm ¨assan, Stockholm, Sweden, July 10-15, 2018 , 2018, pp. 5167–5176

work page 2018

[18] [18]

Expressive speech synthesis via modeling expressions with variational autoencoder,

K. Akuzawa, Y . Iwasawa, and Y . Matsuo, “Expressive speech synthesis via modeling expressions with variational autoencoder,” in Interspeech 2018, 19th Annual Conference of the International Speech Communi- cation Association, Hyderabad, India, 2-6 September 2018. , 2018, pp. 3067–3071

work page 2018

[19] [19]

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wave generation in end-to-end text-to-speech,” CoRR, vol. abs/1807.07281, 2018. [Online]. Available: http://arxiv.org/abs/1807.07281

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Reducing F0 frame error of F0 tracking algorithms under noisy conditions with an unvoiced/voiced classiﬁcation frontend,

Wei Chu and A. Alwan, “Reducing F0 frame error of F0 tracking algorithms under noisy conditions with an unvoiced/voiced classiﬁcation frontend,” in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing , 2009, pp. 3969–3972

work page 2009

[21] [21]

Fastdtw: Toward accurate dynamic time warping in linear time and space,

S. Salvador and P. Chan, “Fastdtw: Toward accurate dynamic time warping in linear time and space,” inKDD workshop on mining temporal and sequential data . Citeseer, 2004. 5

work page 2004