A two-stage static-then-dynamic prompt selection strategy using prosodic features, LLM coherence scores, and similarity metrics improves emotion intensity and speaker consistency in zero-shot TTS.
VioLA: Conditional language models for speech recognition, synthesis, and translation,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.SD 1years
2024 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS
A two-stage static-then-dynamic prompt selection strategy using prosodic features, LLM coherence scores, and similarity metrics improves emotion intensity and speaker consistency in zero-shot TTS.