EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional Speech Synthesis

Bingao Xu; Ganjun Liu; Hongchuan Wu; Jiasheng Chen; Jun Du; Minghui Wu; Ting Meng; Yonglong Cai; Zikun Fang

arxiv: 2606.20650 · v1 · pith:JX4FXY5Dnew · submitted 2026-06-08 · 💻 cs.CL · cs.AI· cs.SD· eess.AS

EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional Speech Synthesis

Minghui Wu , Ganjun Liu , Zikun Fang , Ting Meng , Hongchuan Wu , Bingao Xu , Yonglong Cai , Jiasheng Chen

show 1 more author

Jun Du

This is my paper

classification 💻 cs.CL cs.AIcs.SDeess.AS

keywords emotionalemotionspeechsynthesisdual-pathembeddingsemoinstruct-ttsexplicit

0 comments

read the original abstract

Instruction-based controllable speech synthesis enables users to specify emotions through natural language. However, existing approaches often rely on coarse emotion labels and lack explicit modeling of fine-grained intensity. We propose EmoInstruct-TTS, a dual-path instruction-guided framework for emotional speech synthesis. We introduce Emotion2embed, a supervised semantic-acoustic emotion embedding covering 48 emotional states, including fine-grained categories and intensity levels. To infer embeddings from free-form instructions, we design an Instruction-Conditioned Emotion Flow Model (ICE-Flow) that generates acoustically grounded emotion representations. The inferred embeddings are integrated into an LLM-based synthesis pipeline to provide explicit emotional control while preserving semantic planning. Experiments show improved emotional controllability and speech naturalness over strong baselines.

This paper has not been read by Pith yet.

EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional Speech Synthesis

discussion (0)