The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems

Adaeze Adigwe; Kevin El Haddad; No\'e Tits; Sarah Ostadabbas; Thierry Dutoit

arxiv: 1806.09514 · v1 · pith:SEF73PGCnew · submitted 2018-06-25 · 💻 cs.CL · cs.AI· eess.AS

The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems

Adaeze Adigwe , No\'e Tits , Kevin El Haddad , Sarah Ostadabbas , Thierry Dutoit This is my paper

classification 💻 cs.CL cs.AIeess.AS

keywords datadatabaseemotionaldimensionefficiencyemotiongenerationmale

0 comments

read the original abstract

In this paper, we present a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose. It contains data for male and female actors in English and a male actor in French. The database covers 5 emotion classes so it could be suitable to build synthesis and voice transformation systems with the potential to control the emotional dimension in a continuous way. We show the data's efficiency by building a simple MLP system converting neutral to angry speech style and evaluate it via a CMOS perception test. Even though the system is a very simple one, the test show the efficiency of the data which is promising for future work.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
cs.CL 2023-01 unverdicted novelty 7.0

VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
cs.CL 2025-09 unverdicted novelty 6.0

StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech
eess.AS 2026-05 unverdicted novelty 5.0

Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.
SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization
cs.CV 2026-04 unverdicted novelty 4.0

SEDTalker uses frame-level speech emotion diarization to condition a hybrid Transformer-Mamba model for fine-grained, temporally continuous emotion control in 3D facial animation.
A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -- a Deep Learning approach
eess.AS 2019-07 unverdicted novelty 3.0

A methodology is proposed for emotional text-to-speech using emotional data collection, transfer-learning-based annotation of expressiveness features, and fine-tuning of a neutral TTS model.