The emotional voices database: Towards controlling the emotion dimension in voice generation systems

Adaeze Adigwe, Noé Tits, Kevin El Haddad, Sarah Ostadabbas, Thierry Dutoit · 2018 · cs.CL · arXiv 1806.09514

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

In this paper, we present a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose. It contains data for male and female actors in English and a male actor in French. The database covers 5 emotion classes so it could be suitable to build synthesis and voice transformation systems with the potential to control the emotional dimension in a continuous way. We show the data's efficiency by building a simple MLP system converting neutral to angry speech style and evaluate it via a CMOS perception test. Even though the system is a very simple one, the test show the efficiency of the data which is promising for future work.

representative citing papers

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

cs.CL · 2023-01-05 · unverdicted · novelty 7.0

VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

cs.CL · 2025-09-26 · unverdicted · novelty 6.0

StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.

Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech

eess.AS · 2026-05-20 · unverdicted · novelty 5.0

Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.

SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization

cs.CV · 2026-04-14 · unverdicted · novelty 4.0

SEDTalker uses frame-level speech emotion diarization to condition a hybrid Transformer-Mamba model for fine-grained, temporally continuous emotion control in 3D facial animation.

A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -- a Deep Learning approach

eess.AS · 2019-07-05 · unverdicted · novelty 3.0

A methodology is proposed for emotional text-to-speech using emotional data collection, transfer-learning-based annotation of expressiveness features, and fine-tuning of a neutral TTS model.

citing papers explorer

Showing 5 of 5 citing papers.

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers cs.CL · 2023-01-05 · unverdicted · none · ref 1
VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs cs.CL · 2025-09-26 · unverdicted · none · ref 1 · internal anchor
StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech eess.AS · 2026-05-20 · unverdicted · none · ref 1 · internal anchor
Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.
SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization cs.CV · 2026-04-14 · unverdicted · none · ref 8
SEDTalker uses frame-level speech emotion diarization to condition a hybrid Transformer-Mamba model for fine-grained, temporally continuous emotion control in 3D facial animation.
A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -- a Deep Learning approach eess.AS · 2019-07-05 · unverdicted · none · ref 27 · internal anchor
A methodology is proposed for emotional text-to-speech using emotional data collection, transfer-learning-based annotation of expressiveness features, and fine-tuning of a neutral TTS model.

The emotional voices database: Towards controlling the emotion dimension in voice generation systems

fields

years

verdicts

representative citing papers

citing papers explorer