pith. sign in

arxiv: 2407.14006 · v1 · pith:SNJNDH4Bnew · submitted 2024-07-19 · 📡 eess.AS · cs.SD

MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis

classification 📡 eess.AS cs.SD
keywords speechmscenespeechdatasetsynthesisaudiobaselineexpressivemultiple
0
0 comments X
read the original abstract

We introduce an open source high-quality Mandarin TTS dataset MSceneSpeech (Multiple Scene Speech Dataset), which is intended to provide resources for expressive speech synthesis. MSceneSpeech comprises numerous audio recordings and texts performed and recorded according to daily life scenarios. Each scenario includes multiple speakers and a diverse range of prosodic styles, making it suitable for speech synthesis that entails multi-speaker style and prosody modeling. We have established a robust baseline, through the prompting mechanism, that can effectively synthesize speech characterized by both user-specific timbre and scene-specific prosody with arbitrary text input. The open source MSceneSpeech Dataset and audio samples of our baseline are available at https://speechai-demo.github.io/MSceneSpeech/.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Alethia: A Foundational Encoder for Voice Deepfakes

    cs.SD 2026-04 unverdicted novelty 6.0

    Alethia is a pretrained audio encoder using continuous embedding prediction and generative flow-matching reconstruction that outperforms existing speech foundation models on voice deepfake tasks with better robustness...