pith. sign in

arxiv: 2606.20650 · v1 · pith:JX4FXY5Dnew · submitted 2026-06-08 · 💻 cs.CL · cs.AI· cs.SD· eess.AS

EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional Speech Synthesis

classification 💻 cs.CL cs.AIcs.SDeess.AS
keywords emotionalemotionspeechsynthesisdual-pathembeddingsemoinstruct-ttsexplicit
0
0 comments X
read the original abstract

Instruction-based controllable speech synthesis enables users to specify emotions through natural language. However, existing approaches often rely on coarse emotion labels and lack explicit modeling of fine-grained intensity. We propose EmoInstruct-TTS, a dual-path instruction-guided framework for emotional speech synthesis. We introduce Emotion2embed, a supervised semantic-acoustic emotion embedding covering 48 emotional states, including fine-grained categories and intensity levels. To infer embeddings from free-form instructions, we design an Instruction-Conditioned Emotion Flow Model (ICE-Flow) that generates acoustically grounded emotion representations. The inferred embeddings are integrated into an LLM-based synthesis pipeline to provide explicit emotional control while preserving semantic planning. Experiments show improved emotional controllability and speech naturalness over strong baselines.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.