pith. sign in

arxiv: 2509.13989 · v6 · submitted 2025-09-17 · 📡 eess.AS

Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems

Pith reviewed 2026-05-18 16:34 UTC · model grok-4.3

classification 📡 eess.AS
keywords instruction-guided TTSexpressive speech synthesisperception gaphuman evaluationvoice controlcontrollabilityage bias
0
0 comments X

The pith

Instruction-guided TTS systems show a clear gap between user prompts and how listeners perceive the generated speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how well natural language instructions control expressive features such as degree, emotion intensity, age, and emphasis in synthesized speech. It introduces the E-VOC corpus of large-scale human listener ratings across five ITTS models to expose the mismatch. Findings indicate that one model aligns better than others while most systems default to adult voices and struggle with small changes in instructions. This matters because it shows the promised intuitive control in ITTS is not yet reliable for users.

Core claim

The authors establish through human evaluations that a substantial instruction-perception gap exists in ITTS, with gpt-4o-mini-tts displaying the strongest alignment across acoustic dimensions, all five systems tending to generate adult voices even when instructed to produce child or elderly ones, and fine-grained control over slightly different attributes remaining difficult.

What carries the argument

The E-VOC corpus of human ratings on generated utterances for speaker age and word-level emphasis, applied to quantify alignment in two expressive dimensions.

If this is right

  • Users will frequently receive speech outputs that do not match their intended style instructions.
  • Current ITTS models have limited ability to interpret and apply small differences in attribute instructions.
  • Voice age control is especially unreliable and tends to default to adult characteristics across systems.
  • The E-VOC corpus can serve as a benchmark for measuring progress in instruction following for future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training datasets for these models likely under-represent child and elderly voices in response to explicit instructions.
  • Similar perception gaps may appear in other instruction-guided generative systems such as image or music generation.
  • Developers could create automatic predictors of the gap by correlating the human ratings with objective acoustic measures.

Load-bearing premise

Human listener ratings in the E-VOC corpus accurately capture the true instruction-perception gap without significant bias from rating scale interpretation or listener demographics.

What would settle it

Repeating the listening tests on the same generated utterances with listeners from substantially different demographic groups and obtaining markedly different ratings for age or emphasis attributes would undermine the gap measurements.

read the original abstract

Instruction-guided text-to-speech (ITTS) enables users to control speech generation through natural language prompts, offering a more intuitive interface than traditional TTS. However, the alignment between user style instructions and listener perception remains largely unexplored. This work first presents a perceptual analysis of ITTS controllability across two expressive dimensions (adverbs of degree and graded emotion intensity) and collects human ratings on speaker age and word-level emphasis attributes. To comprehensively reveal the instruction-perception gap, we provide a data collection with large-scale human evaluations, named Expressive VOice Control (E-VOC) corpus. Furthermore, we reveal that (1) gpt-4o-mini-tts is the most reliable ITTS model with great alignment between instruction and generated utterances across acoustic dimensions. (2) The 5 analyzed ITTS systems tend to generate Adult voices even when the instructions ask to use child or Elderly voices. (3) Fine-grained control remains a major challenge, indicating that most ITTS systems have substantial room for improvement in interpreting slightly different attribute instructions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript examines the instruction-perception gap in instruction-guided expressive text-to-speech (ITTS) systems across five models. It introduces the Expressive VOice Control (E-VOC) corpus collected via large-scale human evaluations on expressive dimensions including adverbs of degree, graded emotion intensity, speaker age, and word-level emphasis. The central claims are that gpt-4o-mini-tts exhibits the strongest alignment between instructions and listener perceptions, that the tested systems tend to default to adult voices despite contrary instructions, and that fine-grained control over slightly differing attributes remains challenging for current ITTS systems.

Significance. If the human ratings prove reliable, the work supplies a new evaluation corpus and concrete evidence of controllability limitations in ITTS, which could serve as a benchmark for improving intuitive expressive control in speech synthesis applications.

major comments (1)
  1. [E-VOC corpus description and human evaluation protocol] The headline conclusion that gpt-4o-mini-tts is the most reliable ITTS model with best instruction-perception alignment, along with the adult-voice bias finding, rests entirely on listener ratings from the E-VOC corpus. The manuscript provides no report of inter-rater agreement (Fleiss' kappa, ICC, or equivalent), rater screening criteria, demographic balancing, or tests confirming consistent scale interpretation across the two expressive dimensions and the age/emphasis attributes. This methodological gap directly undermines confidence in the system rankings and the three numbered findings.
minor comments (1)
  1. [Abstract] The abstract states that ratings were collected on 'speaker age and word-level emphasis attributes' but does not specify the exact rating scale (5-point, continuous, etc.) or the number of utterances and raters per condition; adding these numbers would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our manuscript. We have carefully considered the concerns regarding the E-VOC corpus and human evaluation protocol. Below, we provide a point-by-point response and outline the revisions we will make to address these issues.

read point-by-point responses
  1. Referee: The headline conclusion that gpt-4o-mini-tts is the most reliable ITTS model with best instruction-perception alignment, along with the adult-voice bias finding, rests entirely on listener ratings from the E-VOC corpus. The manuscript provides no report of inter-rater agreement (Fleiss' kappa, ICC, or equivalent), rater screening criteria, demographic balancing, or tests confirming consistent scale interpretation across the two expressive dimensions and the age/emphasis attributes. This methodological gap directly undermines confidence in the system rankings and the three numbered findings.

    Authors: We agree that providing details on inter-rater agreement and rater demographics is crucial for validating the human evaluation results. Although the current manuscript focuses on the corpus introduction and findings, we did collect the necessary data during the large-scale evaluation. In the revised version, we will include: (1) inter-rater agreement metrics such as Fleiss' kappa for the ratings on adverbs of degree, emotion intensity, age, and emphasis; (2) rater screening criteria, including requirements for English proficiency and headphone use; (3) available demographic information; and (4) any additional checks for consistent scale use across attributes. These additions will bolster confidence in the reported results, including the superior performance of gpt-4o-mini-tts and the identified biases and challenges. We do not anticipate that these details will change our main conclusions but will enhance the manuscript's rigor. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation study with no mathematical derivations or self-referential loops

full rationale

The paper conducts a perceptual analysis of ITTS systems by collecting human ratings in the E-VOC corpus on attributes such as speaker age, word-level emphasis, adverbs of degree, and graded emotion intensity. All central claims (e.g., gpt-4o-mini-tts showing best alignment) are direct observational results from these ratings and comparisons across five systems. No equations, fitted parameters, predictions derived from models, or self-citations are used to derive the findings; the work is self-contained data collection and reporting without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical perceptual study; no free parameters, mathematical axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5739 in / 948 out tokens · 36552 ms · 2026-05-18T16:34:57.863300+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Toward Fair Speech Technologies: A Comprehensive Survey of Bias and Fairness in Speech AI

    eess.AS 2026-05 accept novelty 7.0

    The paper delivers a unified framework for fairness in speech technologies by formalizing seven definitions, organizing research into three paradigms, diagnosing pipeline-specific biases, and mapping mitigations to th...

  2. CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

    cs.SD 2026-04 unverdicted novelty 7.0

    CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 2 Pith papers · 1 internal anchor

  1. [1]

    read this joyfully

    INTRODUCTION Instruction-guided text-to-speech (ITTS) [1,2] enables users to steer speech synthesis using natural-language prompts (e.g., “read this joyfully” or “speak like a child”). This approach offers a trans- parent and flexible alternative to conventional TTS pipelines [3, 4] that often require low-level acoustic controls or specialized labels for ...

  2. [2]

    RELA TED WORKS AND BACKGROUND 2.1. ITTS Systems and Selection The field ofInstruction-guided Text-to-Speech(ITTS) has seen rapid advancement, with many models capable of generating speech from descriptive prompts [1]. Although robust systems such as Audiobox

  3. [3]

    Therefore, to ensure a comprehensive and replicable analysis, we selected five represen- tative models across three distinct categories

    exist, their closed-source nature limits transparency and repro- ducibility, which are essential for this study. Therefore, to ensure a comprehensive and replicable analysis, we selected five represen- tative models across three distinct categories. First, to represent the state-of-the-art in open-source research, we includedParler-TTS

  4. [4]

    Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems

    andPromptTTS++[11]. These models are publicly available, allowing for the in-depth analysis required for this study. Second, to represent the leading edge of commercial ITTS systems, we in- corporateGPT-4o-mini-TTS[12]. Its efficient and high-quality API provides an insightful analysis for production-grade expressive syn- thesis. Finally, to test the capa...

  5. [5]

    slightly,

    EV ALUA TION FRAMEWORK We designed a comprehensive evaluation framework to investigate the instruction-perception gap in ITTS systematically. This frame- work consists of 3 core components: the control dimensions that define the evaluation tasks (Section 3.1), the evaluation metrics used to quantify alignment (Section 3.2), and the E-VOC corpus of human p...

  6. [6]

    slightly

    EXPERIMENTAL RESULTS AND ANALYSES 4.1. Adverbs of Degree As shown in Figure 1, gpt-4o provides the clearest and most con- sistent mapping from degree adverbs to acoustic features. Figure 2 (top row) extends this analysis to perceived emotion intensity under adverb cues. Loudness.gpt-4o spans a wide LUFS range with a predictable or- dering from “slightly” ...

  7. [7]

    slightly happy

    CONCLUSION AND FUTURE WORK Conclusion.This work addresses the largely unexplored link be- tween natural-language instructions and listener perception in ITTS. We proposed a novel framework for evaluating fine-grained control using adverbs of degree (e.g., “slightly happy”) and ordered emo- tional adjectives (e.g., from “Content” to “Happy” to “Ecstatic”)....

  8. [8]

    Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt,

    Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, and Helen Meng, “Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2913–2925, 2024

  9. [9]

    Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training,

    Zhihao Du et al., “Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training,” 2025

  10. [10]

    FastSpeech 2: Fast and High-Quality End- to-End Text to Speech,

    Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “FastSpeech 2: Fast and High-Quality End- to-End Text to Speech,” inInternational Conference on Learn- ing Representations, 2021

  11. [11]

    YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero- Shot V oice Conversion for Everyone,

    Edresson Casanova, Julian Weber, Christopher D Shulby, Ar- naldo Candido Junior, Eren G ¨olge, and Moacir A Ponti, “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero- Shot V oice Conversion for Everyone,” inProceedings of the 39th International Conference on Machine Learning, Kama- lika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu,...

  12. [12]

    MOSNet: Deep Learning-Based Objective Assessment for V oice Conver- sion,

    Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Ju- nichi Yamagishi, Yu Tsao, and Hsin-Min Wang, “MOSNet: Deep Learning-Based Objective Assessment for V oice Conver- sion,” inInterspeech 2019, 2019, pp. 1541–1545

  13. [13]

    NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,

    Gabriel Mittag et al., “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” inInterspeech 2021, 2021

  14. [14]

    HighRateMOS: Sampling-Rate Aware Modeling for Speech Quality Assessment,

    Wenze Ren, Yi-Cheng Lin, Wen-Chin Huang, Ryandhimas E. Zezario, Szu-Wei Fu, Sung-Feng Huang, Erica Cooper, Haibin Wu, Hung-Yu Wei, Hsin-Min Wang, Hung yi Lee, and Yu Tsao, “HighRateMOS: Sampling-Rate Aware Modeling for Speech Quality Assessment,” 2025

  15. [15]

    Predicting speech intelligibility of enhanced speech us- ing phone accuracy of dnn-based asr system,

    Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Ki- noshita, Tomohiro Nakatani, Katsuhiko Yamamoto, and Toshio Irino, “Predicting speech intelligibility of enhanced speech us- ing phone accuracy of dnn-based asr system,” inInterspeech 2019, 2019, pp. 4275–4279

  16. [16]

    Audiobox: Unified Audio Generation with Natural Language Prompts,

    Apoorv Vyas et al., “Audiobox: Unified Audio Generation with Natural Language Prompts,” 2023

  17. [17]

    Natural language guidance of high-fidelity text-to-speech with synthetic annotations,

    Dan Lyth and Simon King, “Natural language guidance of high-fidelity text-to-speech with synthetic annotations,” 2024

  18. [18]

    PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-To-Speech Using Natural Lan- guage Descriptions,

    Reo Shimizu et al., “PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-To-Speech Using Natural Lan- guage Descriptions,” inICASSP 2024 - 2024 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 12672–12676

  19. [19]

    Introducing next-generation audio models in the API,

    OpenAI, “Introducing next-generation audio models in the API,” March 2025

  20. [20]

    UniAudio: An Audio Foundation Model Toward Universal Audio Generation,

    Dongchao Yang et al., “UniAudio: An Audio Foundation Model Toward Universal Audio Generation,” 2024

  21. [21]

    EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech Via Emotion-Adaptive Spherical Vector,

    Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, and Seong-Whan Lee, “EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech Via Emotion-Adaptive Spherical Vector,”IEEE Transactions on Affective Computing, vol. 16, no. 3, pp. 2365–2380, 2025

  22. [22]

    emotion2vec: Self- Supervised Pre-Training for Speech Emotion Representation,

    Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, ShiLiang Zhang, and Xie Chen, “emotion2vec: Self- Supervised Pre-Training for Speech Emotion Representation,” inFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar, Eds., Bangkok, Thailand, Aug. 2024, pp. 15747–15760, Asso- ciation f...

  23. [23]

    V oxInstruct: Ex- pressive Human Instruction-to-Speech Generation with Uni- fied Multilingual Codec Language Modelling,

    Yixuan Zhou, Xiaoyu Qin, Zeyu Jin, Shuoyi Zhou, Shun Lei, Songtao Zhou, Zhiyong Wu, and Jia Jia, “V oxInstruct: Ex- pressive Human Instruction-to-Speech Generation with Uni- fied Multilingual Codec Language Modelling,” inProceed- ings of the 32nd ACM International Conference on Multime- dia, New York, NY , USA, 2024, MM ’24, p. 554–563, Associ- ation for ...

  24. [24]

    EmoV oice: LLM-based Emotional Text-To- Speech Model with Freestyle Text Prompting,

    Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, and Xie Chen, “EmoV oice: LLM-based Emotional Text-To- Speech Model with Freestyle Text Prompting,” 2025

  25. [25]

    SpeechCraft: A Fine- Grained Expressive Speech Dataset with Natural Language Description,

    Zeyu Jin, Jia Jia, Qixin Wang, Kehan Li, Shuoyi Zhou, Songtao Zhou, Xiaoyu Qin, and Zhiyong Wu, “SpeechCraft: A Fine- Grained Expressive Speech Dataset with Natural Language Description,” inProceedings of the 32nd ACM International Conference on Multimedia, New York, NY , USA, 2024, MM ’24, p. 1255–1264, Association for Computing Machinery

  26. [26]

    InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems,

    Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, et al., “InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems,” 2025

  27. [27]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities,

    Gheorghe Comanici et al., “Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities,” 2025

  28. [28]

    Word Affect Intensities,

    Saif Mohammad, “Word Affect Intensities,” inProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018, Eu- ropean Language Resources Association (ELRA)

  29. [29]

    English Wikipedia database dump,

    The Wikipedia contributors, “English Wikipedia database dump,” Available:https://dumps.wikimedia.org/ enwiki/20230413/, 2023, Accessed: Apr. 3, 2025

  30. [30]

    Crepe: A Convolutional Representation for Pitch Esti- mation,

    Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello, “Crepe: A Convolutional Representation for Pitch Esti- mation,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 161–165

  31. [31]

    Parler-TTS,

    Yoach Lacombe, Vaibhav Srivastav, and Sanchit Gandhi, “Parler-TTS,”https://github.com/huggingface/ parler-tts, 2024

  32. [32]

    GPT-4o mini TTS,

    OpenAI, “GPT-4o mini TTS,” Text-to-Speech Model Docu- mentation, 2025, Version accessed on March 4, 2025

  33. [33]

    CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset,

    Houwei Cao, David G. Cooper, Michael K. Keutmann, Ruben C. Gur, Ani Nenkova, and Ragini Verma, “CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset,”IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377– 390, 2014

  34. [34]

    50.5 Hours — English (America) Children Scripted Monologue Microphone Speech Dataset,

    NEXDATA AI, “50.5 Hours — English (America) Children Scripted Monologue Microphone Speech Dataset,”https://www.nexdata.ai/datasets/ speechrecog/75?source=Github, 2025

  35. [35]

    Audio-Aware Large Language Models as Judges for Speaking Styles,

    Cheng-Han Chiang, Xiaofei Wang, et al., “Audio-Aware Large Language Models as Judges for Speaking Styles,” 2025