MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis

Baoxing Huai; Feiyang Chen; Jialong Zuo; Mingze Li; Qian Yang; Zhefeng Wang; Zhe Su; Zhou Zhao; Ziyue Jiang

arxiv: 2407.14006 · v1 · pith:SNJNDH4Bnew · submitted 2024-07-19 · 📡 eess.AS · cs.SD

MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis

Qian Yang , Jialong Zuo , Zhe Su , Ziyue Jiang , Mingze Li , Zhou Zhao , Feiyang Chen , Zhefeng Wang

show 1 more author

Baoxing Huai

This is my paper

classification 📡 eess.AS cs.SD

keywords speechmscenespeechdatasetsynthesisaudiobaselineexpressivemultiple

0 comments

read the original abstract

We introduce an open source high-quality Mandarin TTS dataset MSceneSpeech (Multiple Scene Speech Dataset), which is intended to provide resources for expressive speech synthesis. MSceneSpeech comprises numerous audio recordings and texts performed and recorded according to daily life scenarios. Each scenario includes multiple speakers and a diverse range of prosodic styles, making it suitable for speech synthesis that entails multi-speaker style and prosody modeling. We have established a robust baseline, through the prompting mechanism, that can effectively synthesize speech characterized by both user-specific timbre and scene-specific prosody with arbitrary text input. The open source MSceneSpeech Dataset and audio samples of our baseline are available at https://speechai-demo.github.io/MSceneSpeech/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Alethia: A Foundational Encoder for Voice Deepfakes
cs.SD 2026-04 unverdicted novelty 6.0

Alethia is a pretrained audio encoder using continuous embedding prediction and generative flow-matching reconstruction that outperforms existing speech foundation models on voice deepfake tasks with better robustness...