Controllable Emphasis with zero data for text-to-speech

Alessandro Lombardi; Alexis Moinet; Aman Hussain; Ammar Abbas; Antonio Bonafonte; Arent van Korlaar; Arnaud Joly; Ekaterina Peterova; Elena Sokolova; Marco Nicolis

arxiv: 2307.07062 · v1 · pith:NEDDRPFNnew · submitted 2023-07-13 · 📡 eess.AS · cs.LG· cs.SD

Controllable Emphasis with zero data for text-to-speech

Arnaud Joly , Marco Nicolis , Ekaterina Peterova , Alessandro Lombardi , Ammar Abbas , Arent van Korlaar , Aman Hussain , Parul Sharma

show 6 more authors

Alexis Moinet Mateusz Lajszczak Penny Karanasou Antonio Bonafonte Thomas Drugman Elena Sokolova

This is my paper

classification 📡 eess.AS cs.LGcs.SD

keywords methoddurationemphasisemphasizedrecordingsrequirescalablesignificantly

0 comments

read the original abstract

We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques improving naturalness by $7.3\%$ and correct testers' identification of the emphasized word in a sentence by $40\%$ on a reference female en-US voice. We show that this technique significantly closes the gap to methods that require explicit recordings. The method proved to be scalable and preferred in all four languages tested (English, Spanish, Italian, German), for different voices and multiple speaking styles.

This paper has not been read by Pith yet.

Controllable Emphasis with zero data for text-to-speech

discussion (0)