Introduces the E-VOC corpus and shows that five ITTS systems, including gpt-4o-mini-tts as the best, still default to adult voices and struggle with fine-grained expressive control.
50.5 Hours — English (America) Children Scripted Monologue Microphone Speech Dataset
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
eess.AS 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems
Introduces the E-VOC corpus and shows that five ITTS systems, including gpt-4o-mini-tts as the best, still default to adult voices and struggle with fine-grained expressive control.