Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems
Pith reviewed 2026-05-18 16:34 UTC · model grok-4.3
The pith
Instruction-guided TTS systems show a clear gap between user prompts and how listeners perceive the generated speech.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish through human evaluations that a substantial instruction-perception gap exists in ITTS, with gpt-4o-mini-tts displaying the strongest alignment across acoustic dimensions, all five systems tending to generate adult voices even when instructed to produce child or elderly ones, and fine-grained control over slightly different attributes remaining difficult.
What carries the argument
The E-VOC corpus of human ratings on generated utterances for speaker age and word-level emphasis, applied to quantify alignment in two expressive dimensions.
If this is right
- Users will frequently receive speech outputs that do not match their intended style instructions.
- Current ITTS models have limited ability to interpret and apply small differences in attribute instructions.
- Voice age control is especially unreliable and tends to default to adult characteristics across systems.
- The E-VOC corpus can serve as a benchmark for measuring progress in instruction following for future models.
Where Pith is reading between the lines
- Training datasets for these models likely under-represent child and elderly voices in response to explicit instructions.
- Similar perception gaps may appear in other instruction-guided generative systems such as image or music generation.
- Developers could create automatic predictors of the gap by correlating the human ratings with objective acoustic measures.
Load-bearing premise
Human listener ratings in the E-VOC corpus accurately capture the true instruction-perception gap without significant bias from rating scale interpretation or listener demographics.
What would settle it
Repeating the listening tests on the same generated utterances with listeners from substantially different demographic groups and obtaining markedly different ratings for age or emphasis attributes would undermine the gap measurements.
read the original abstract
Instruction-guided text-to-speech (ITTS) enables users to control speech generation through natural language prompts, offering a more intuitive interface than traditional TTS. However, the alignment between user style instructions and listener perception remains largely unexplored. This work first presents a perceptual analysis of ITTS controllability across two expressive dimensions (adverbs of degree and graded emotion intensity) and collects human ratings on speaker age and word-level emphasis attributes. To comprehensively reveal the instruction-perception gap, we provide a data collection with large-scale human evaluations, named Expressive VOice Control (E-VOC) corpus. Furthermore, we reveal that (1) gpt-4o-mini-tts is the most reliable ITTS model with great alignment between instruction and generated utterances across acoustic dimensions. (2) The 5 analyzed ITTS systems tend to generate Adult voices even when the instructions ask to use child or Elderly voices. (3) Fine-grained control remains a major challenge, indicating that most ITTS systems have substantial room for improvement in interpreting slightly different attribute instructions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines the instruction-perception gap in instruction-guided expressive text-to-speech (ITTS) systems across five models. It introduces the Expressive VOice Control (E-VOC) corpus collected via large-scale human evaluations on expressive dimensions including adverbs of degree, graded emotion intensity, speaker age, and word-level emphasis. The central claims are that gpt-4o-mini-tts exhibits the strongest alignment between instructions and listener perceptions, that the tested systems tend to default to adult voices despite contrary instructions, and that fine-grained control over slightly differing attributes remains challenging for current ITTS systems.
Significance. If the human ratings prove reliable, the work supplies a new evaluation corpus and concrete evidence of controllability limitations in ITTS, which could serve as a benchmark for improving intuitive expressive control in speech synthesis applications.
major comments (1)
- [E-VOC corpus description and human evaluation protocol] The headline conclusion that gpt-4o-mini-tts is the most reliable ITTS model with best instruction-perception alignment, along with the adult-voice bias finding, rests entirely on listener ratings from the E-VOC corpus. The manuscript provides no report of inter-rater agreement (Fleiss' kappa, ICC, or equivalent), rater screening criteria, demographic balancing, or tests confirming consistent scale interpretation across the two expressive dimensions and the age/emphasis attributes. This methodological gap directly undermines confidence in the system rankings and the three numbered findings.
minor comments (1)
- [Abstract] The abstract states that ratings were collected on 'speaker age and word-level emphasis attributes' but does not specify the exact rating scale (5-point, continuous, etc.) or the number of utterances and raters per condition; adding these numbers would improve readability.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable comments on our manuscript. We have carefully considered the concerns regarding the E-VOC corpus and human evaluation protocol. Below, we provide a point-by-point response and outline the revisions we will make to address these issues.
read point-by-point responses
-
Referee: The headline conclusion that gpt-4o-mini-tts is the most reliable ITTS model with best instruction-perception alignment, along with the adult-voice bias finding, rests entirely on listener ratings from the E-VOC corpus. The manuscript provides no report of inter-rater agreement (Fleiss' kappa, ICC, or equivalent), rater screening criteria, demographic balancing, or tests confirming consistent scale interpretation across the two expressive dimensions and the age/emphasis attributes. This methodological gap directly undermines confidence in the system rankings and the three numbered findings.
Authors: We agree that providing details on inter-rater agreement and rater demographics is crucial for validating the human evaluation results. Although the current manuscript focuses on the corpus introduction and findings, we did collect the necessary data during the large-scale evaluation. In the revised version, we will include: (1) inter-rater agreement metrics such as Fleiss' kappa for the ratings on adverbs of degree, emotion intensity, age, and emphasis; (2) rater screening criteria, including requirements for English proficiency and headphone use; (3) available demographic information; and (4) any additional checks for consistent scale use across attributes. These additions will bolster confidence in the reported results, including the superior performance of gpt-4o-mini-tts and the identified biases and challenges. We do not anticipate that these details will change our main conclusions but will enhance the manuscript's rigor. revision: yes
Circularity Check
Empirical evaluation study with no mathematical derivations or self-referential loops
full rationale
The paper conducts a perceptual analysis of ITTS systems by collecting human ratings in the E-VOC corpus on attributes such as speaker age, word-level emphasis, adverbs of degree, and graded emotion intensity. All central claims (e.g., gpt-4o-mini-tts showing best alignment) are direct observational results from these ratings and comparisons across five systems. No equations, fitted parameters, predictions derived from models, or self-citations are used to derive the findings; the work is self-contained data collection and reporting without any reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
Toward Fair Speech Technologies: A Comprehensive Survey of Bias and Fairness in Speech AI
The paper delivers a unified framework for fairness in speech technologies by formalizing seven definitions, organizing research into three paradigms, diagnosing pipeline-specific biases, and mapping mitigations to th...
-
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Instruction-guided text-to-speech (ITTS) [1,2] enables users to steer speech synthesis using natural-language prompts (e.g., “read this joyfully” or “speak like a child”). This approach offers a trans- parent and flexible alternative to conventional TTS pipelines [3, 4] that often require low-level acoustic controls or specialized labels for ...
-
[2]
RELA TED WORKS AND BACKGROUND 2.1. ITTS Systems and Selection The field ofInstruction-guided Text-to-Speech(ITTS) has seen rapid advancement, with many models capable of generating speech from descriptive prompts [1]. Although robust systems such as Audiobox
-
[3]
exist, their closed-source nature limits transparency and repro- ducibility, which are essential for this study. Therefore, to ensure a comprehensive and replicable analysis, we selected five represen- tative models across three distinct categories. First, to represent the state-of-the-art in open-source research, we includedParler-TTS
-
[4]
andPromptTTS++[11]. These models are publicly available, allowing for the in-depth analysis required for this study. Second, to represent the leading edge of commercial ITTS systems, we in- corporateGPT-4o-mini-TTS[12]. Its efficient and high-quality API provides an insightful analysis for production-grade expressive syn- thesis. Finally, to test the capa...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
EV ALUA TION FRAMEWORK We designed a comprehensive evaluation framework to investigate the instruction-perception gap in ITTS systematically. This frame- work consists of 3 core components: the control dimensions that define the evaluation tasks (Section 3.1), the evaluation metrics used to quantify alignment (Section 3.2), and the E-VOC corpus of human p...
-
[6]
EXPERIMENTAL RESULTS AND ANALYSES 4.1. Adverbs of Degree As shown in Figure 1, gpt-4o provides the clearest and most con- sistent mapping from degree adverbs to acoustic features. Figure 2 (top row) extends this analysis to perceived emotion intensity under adverb cues. Loudness.gpt-4o spans a wide LUFS range with a predictable or- dering from “slightly” ...
-
[7]
CONCLUSION AND FUTURE WORK Conclusion.This work addresses the largely unexplored link be- tween natural-language instructions and listener perception in ITTS. We proposed a novel framework for evaluating fine-grained control using adverbs of degree (e.g., “slightly happy”) and ordered emo- tional adjectives (e.g., from “Content” to “Happy” to “Ecstatic”)....
-
[8]
Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt,
Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, and Helen Meng, “Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2913–2925, 2024
work page 2024
-
[9]
Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training,
Zhihao Du et al., “Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training,” 2025
work page 2025
-
[10]
FastSpeech 2: Fast and High-Quality End- to-End Text to Speech,
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “FastSpeech 2: Fast and High-Quality End- to-End Text to Speech,” inInternational Conference on Learn- ing Representations, 2021
work page 2021
-
[11]
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero- Shot V oice Conversion for Everyone,
Edresson Casanova, Julian Weber, Christopher D Shulby, Ar- naldo Candido Junior, Eren G ¨olge, and Moacir A Ponti, “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero- Shot V oice Conversion for Everyone,” inProceedings of the 39th International Conference on Machine Learning, Kama- lika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu,...
work page 2022
-
[12]
MOSNet: Deep Learning-Based Objective Assessment for V oice Conver- sion,
Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Ju- nichi Yamagishi, Yu Tsao, and Hsin-Min Wang, “MOSNet: Deep Learning-Based Objective Assessment for V oice Conver- sion,” inInterspeech 2019, 2019, pp. 1541–1545
work page 2019
-
[13]
Gabriel Mittag et al., “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” inInterspeech 2021, 2021
work page 2021
-
[14]
HighRateMOS: Sampling-Rate Aware Modeling for Speech Quality Assessment,
Wenze Ren, Yi-Cheng Lin, Wen-Chin Huang, Ryandhimas E. Zezario, Szu-Wei Fu, Sung-Feng Huang, Erica Cooper, Haibin Wu, Hung-Yu Wei, Hsin-Min Wang, Hung yi Lee, and Yu Tsao, “HighRateMOS: Sampling-Rate Aware Modeling for Speech Quality Assessment,” 2025
work page 2025
-
[15]
Predicting speech intelligibility of enhanced speech us- ing phone accuracy of dnn-based asr system,
Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Ki- noshita, Tomohiro Nakatani, Katsuhiko Yamamoto, and Toshio Irino, “Predicting speech intelligibility of enhanced speech us- ing phone accuracy of dnn-based asr system,” inInterspeech 2019, 2019, pp. 4275–4279
work page 2019
-
[16]
Audiobox: Unified Audio Generation with Natural Language Prompts,
Apoorv Vyas et al., “Audiobox: Unified Audio Generation with Natural Language Prompts,” 2023
work page 2023
-
[17]
Natural language guidance of high-fidelity text-to-speech with synthetic annotations,
Dan Lyth and Simon King, “Natural language guidance of high-fidelity text-to-speech with synthetic annotations,” 2024
work page 2024
-
[18]
Reo Shimizu et al., “PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-To-Speech Using Natural Lan- guage Descriptions,” inICASSP 2024 - 2024 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 12672–12676
work page 2024
-
[19]
Introducing next-generation audio models in the API,
OpenAI, “Introducing next-generation audio models in the API,” March 2025
work page 2025
-
[20]
UniAudio: An Audio Foundation Model Toward Universal Audio Generation,
Dongchao Yang et al., “UniAudio: An Audio Foundation Model Toward Universal Audio Generation,” 2024
work page 2024
-
[21]
EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech Via Emotion-Adaptive Spherical Vector,
Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, and Seong-Whan Lee, “EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech Via Emotion-Adaptive Spherical Vector,”IEEE Transactions on Affective Computing, vol. 16, no. 3, pp. 2365–2380, 2025
work page 2025
-
[22]
emotion2vec: Self- Supervised Pre-Training for Speech Emotion Representation,
Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, ShiLiang Zhang, and Xie Chen, “emotion2vec: Self- Supervised Pre-Training for Speech Emotion Representation,” inFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar, Eds., Bangkok, Thailand, Aug. 2024, pp. 15747–15760, Asso- ciation f...
work page 2024
-
[23]
Yixuan Zhou, Xiaoyu Qin, Zeyu Jin, Shuoyi Zhou, Shun Lei, Songtao Zhou, Zhiyong Wu, and Jia Jia, “V oxInstruct: Ex- pressive Human Instruction-to-Speech Generation with Uni- fied Multilingual Codec Language Modelling,” inProceed- ings of the 32nd ACM International Conference on Multime- dia, New York, NY , USA, 2024, MM ’24, p. 554–563, Associ- ation for ...
work page 2024
-
[24]
EmoV oice: LLM-based Emotional Text-To- Speech Model with Freestyle Text Prompting,
Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, and Xie Chen, “EmoV oice: LLM-based Emotional Text-To- Speech Model with Freestyle Text Prompting,” 2025
work page 2025
-
[25]
SpeechCraft: A Fine- Grained Expressive Speech Dataset with Natural Language Description,
Zeyu Jin, Jia Jia, Qixin Wang, Kehan Li, Shuoyi Zhou, Songtao Zhou, Xiaoyu Qin, and Zhiyong Wu, “SpeechCraft: A Fine- Grained Expressive Speech Dataset with Natural Language Description,” inProceedings of the 32nd ACM International Conference on Multimedia, New York, NY , USA, 2024, MM ’24, p. 1255–1264, Association for Computing Machinery
work page 2024
-
[26]
Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, et al., “InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems,” 2025
work page 2025
-
[27]
Gheorghe Comanici et al., “Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities,” 2025
work page 2025
-
[28]
Saif Mohammad, “Word Affect Intensities,” inProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018, Eu- ropean Language Resources Association (ELRA)
work page 2018
-
[29]
English Wikipedia database dump,
The Wikipedia contributors, “English Wikipedia database dump,” Available:https://dumps.wikimedia.org/ enwiki/20230413/, 2023, Accessed: Apr. 3, 2025
-
[30]
Crepe: A Convolutional Representation for Pitch Esti- mation,
Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello, “Crepe: A Convolutional Representation for Pitch Esti- mation,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 161–165
work page 2018
-
[31]
Yoach Lacombe, Vaibhav Srivastav, and Sanchit Gandhi, “Parler-TTS,”https://github.com/huggingface/ parler-tts, 2024
work page 2024
-
[32]
OpenAI, “GPT-4o mini TTS,” Text-to-Speech Model Docu- mentation, 2025, Version accessed on March 4, 2025
work page 2025
-
[33]
CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset,
Houwei Cao, David G. Cooper, Michael K. Keutmann, Ruben C. Gur, Ani Nenkova, and Ragini Verma, “CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset,”IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377– 390, 2014
work page 2014
-
[34]
50.5 Hours — English (America) Children Scripted Monologue Microphone Speech Dataset,
NEXDATA AI, “50.5 Hours — English (America) Children Scripted Monologue Microphone Speech Dataset,”https://www.nexdata.ai/datasets/ speechrecog/75?source=Github, 2025
work page 2025
-
[35]
Audio-Aware Large Language Models as Judges for Speaking Styles,
Cheng-Han Chiang, Xiaofei Wang, et al., “Audio-Aware Large Language Models as Judges for Speaking Styles,” 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.