Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems

Huang-Cheng Chou; Hung-yi Lee; Kuan-Yu Chen; Tzu-Chieh Wei; Yi-Cheng Lin

arxiv: 2509.13989 · v6 · submitted 2025-09-17 · 📡 eess.AS

Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems

Yi-Cheng Lin , Huang-Cheng Chou , Tzu-Chieh Wei , Kuan-Yu Chen , Hung-yi Lee This is my paper

Pith reviewed 2026-05-18 16:34 UTC · model grok-4.3

classification 📡 eess.AS

keywords instruction-guided TTSexpressive speech synthesisperception gaphuman evaluationvoice controlcontrollabilityage bias

0 comments

The pith

Instruction-guided TTS systems show a clear gap between user prompts and how listeners perceive the generated speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how well natural language instructions control expressive features such as degree, emotion intensity, age, and emphasis in synthesized speech. It introduces the E-VOC corpus of large-scale human listener ratings across five ITTS models to expose the mismatch. Findings indicate that one model aligns better than others while most systems default to adult voices and struggle with small changes in instructions. This matters because it shows the promised intuitive control in ITTS is not yet reliable for users.

Core claim

The authors establish through human evaluations that a substantial instruction-perception gap exists in ITTS, with gpt-4o-mini-tts displaying the strongest alignment across acoustic dimensions, all five systems tending to generate adult voices even when instructed to produce child or elderly ones, and fine-grained control over slightly different attributes remaining difficult.

What carries the argument

The E-VOC corpus of human ratings on generated utterances for speaker age and word-level emphasis, applied to quantify alignment in two expressive dimensions.

If this is right

Users will frequently receive speech outputs that do not match their intended style instructions.
Current ITTS models have limited ability to interpret and apply small differences in attribute instructions.
Voice age control is especially unreliable and tends to default to adult characteristics across systems.
The E-VOC corpus can serve as a benchmark for measuring progress in instruction following for future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training datasets for these models likely under-represent child and elderly voices in response to explicit instructions.
Similar perception gaps may appear in other instruction-guided generative systems such as image or music generation.
Developers could create automatic predictors of the gap by correlating the human ratings with objective acoustic measures.

Load-bearing premise

Human listener ratings in the E-VOC corpus accurately capture the true instruction-perception gap without significant bias from rating scale interpretation or listener demographics.

What would settle it

Repeating the listening tests on the same generated utterances with listeners from substantially different demographic groups and obtaining markedly different ratings for age or emphasis attributes would undermine the gap measurements.

read the original abstract

Instruction-guided text-to-speech (ITTS) enables users to control speech generation through natural language prompts, offering a more intuitive interface than traditional TTS. However, the alignment between user style instructions and listener perception remains largely unexplored. This work first presents a perceptual analysis of ITTS controllability across two expressive dimensions (adverbs of degree and graded emotion intensity) and collects human ratings on speaker age and word-level emphasis attributes. To comprehensively reveal the instruction-perception gap, we provide a data collection with large-scale human evaluations, named Expressive VOice Control (E-VOC) corpus. Furthermore, we reveal that (1) gpt-4o-mini-tts is the most reliable ITTS model with great alignment between instruction and generated utterances across acoustic dimensions. (2) The 5 analyzed ITTS systems tend to generate Adult voices even when the instructions ask to use child or Elderly voices. (3) Fine-grained control remains a major challenge, indicating that most ITTS systems have substantial room for improvement in interpreting slightly different attribute instructions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces the E-VOC corpus and shows that gpt-4o-mini-tts aligns best with instructions but all five ITTS systems default to adult voices and struggle with fine-grained attributes.

read the letter

The main thing to know is that this work measures the gap between style instructions and what listeners actually perceive in expressive TTS, using a new set of human ratings on adverbs, emotion intensity, age, and emphasis. They compare five systems and conclude that gpt-4o-mini-tts comes closest overall while the others show clear biases toward adult voices even when instructed otherwise. Fine-grained control stays difficult across the board. That is the core contribution. The E-VOC corpus itself looks like the freshest part, since it collects targeted perceptual data on graded attributes rather than just overall naturalness or MOS scores. The large-scale listening tests give a concrete picture of where current models fall short for applications that need precise control. The comparison across multiple ITTS systems is useful because it moves past single-model case studies. The soft spot is the reliance on those listener ratings without reported inter-rater agreement, rater demographics, or checks on scale interpretation. If listeners default to certain voice assumptions or compress distinctions on the rating scales, the ranking of the models and the claim about gpt-4o-mini-tts reliability could shift. The abstract does not spell out the protocol, so that needs tightening before the findings can be taken as settled. This paper is aimed at people building or evaluating controllable TTS for voice assistants and accessibility tools. Readers who care about practical controllability limits will find the corpus and the attribute-specific results worth looking at. It is solid enough on the empirical side to deserve a serious referee, mainly because the question is timely and the data collection is new. I would send it out for review with a request for the missing details on how the ratings were collected and validated.

Referee Report

1 major / 1 minor

Summary. The manuscript examines the instruction-perception gap in instruction-guided expressive text-to-speech (ITTS) systems across five models. It introduces the Expressive VOice Control (E-VOC) corpus collected via large-scale human evaluations on expressive dimensions including adverbs of degree, graded emotion intensity, speaker age, and word-level emphasis. The central claims are that gpt-4o-mini-tts exhibits the strongest alignment between instructions and listener perceptions, that the tested systems tend to default to adult voices despite contrary instructions, and that fine-grained control over slightly differing attributes remains challenging for current ITTS systems.

Significance. If the human ratings prove reliable, the work supplies a new evaluation corpus and concrete evidence of controllability limitations in ITTS, which could serve as a benchmark for improving intuitive expressive control in speech synthesis applications.

major comments (1)

[E-VOC corpus description and human evaluation protocol] The headline conclusion that gpt-4o-mini-tts is the most reliable ITTS model with best instruction-perception alignment, along with the adult-voice bias finding, rests entirely on listener ratings from the E-VOC corpus. The manuscript provides no report of inter-rater agreement (Fleiss' kappa, ICC, or equivalent), rater screening criteria, demographic balancing, or tests confirming consistent scale interpretation across the two expressive dimensions and the age/emphasis attributes. This methodological gap directly undermines confidence in the system rankings and the three numbered findings.

minor comments (1)

[Abstract] The abstract states that ratings were collected on 'speaker age and word-level emphasis attributes' but does not specify the exact rating scale (5-point, continuous, etc.) or the number of utterances and raters per condition; adding these numbers would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our manuscript. We have carefully considered the concerns regarding the E-VOC corpus and human evaluation protocol. Below, we provide a point-by-point response and outline the revisions we will make to address these issues.

read point-by-point responses

Referee: The headline conclusion that gpt-4o-mini-tts is the most reliable ITTS model with best instruction-perception alignment, along with the adult-voice bias finding, rests entirely on listener ratings from the E-VOC corpus. The manuscript provides no report of inter-rater agreement (Fleiss' kappa, ICC, or equivalent), rater screening criteria, demographic balancing, or tests confirming consistent scale interpretation across the two expressive dimensions and the age/emphasis attributes. This methodological gap directly undermines confidence in the system rankings and the three numbered findings.

Authors: We agree that providing details on inter-rater agreement and rater demographics is crucial for validating the human evaluation results. Although the current manuscript focuses on the corpus introduction and findings, we did collect the necessary data during the large-scale evaluation. In the revised version, we will include: (1) inter-rater agreement metrics such as Fleiss' kappa for the ratings on adverbs of degree, emotion intensity, age, and emphasis; (2) rater screening criteria, including requirements for English proficiency and headphone use; (3) available demographic information; and (4) any additional checks for consistent scale use across attributes. These additions will bolster confidence in the reported results, including the superior performance of gpt-4o-mini-tts and the identified biases and challenges. We do not anticipate that these details will change our main conclusions but will enhance the manuscript's rigor. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation study with no mathematical derivations or self-referential loops

full rationale

The paper conducts a perceptual analysis of ITTS systems by collecting human ratings in the E-VOC corpus on attributes such as speaker age, word-level emphasis, adverbs of degree, and graded emotion intensity. All central claims (e.g., gpt-4o-mini-tts showing best alignment) are direct observational results from these ratings and comparisons across five systems. No equations, fitted parameters, predictions derived from models, or self-citations are used to derive the findings; the work is self-contained data collection and reporting without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical perceptual study; no free parameters, mathematical axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5739 in / 948 out tokens · 36552 ms · 2026-05-18T16:34:57.863300+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Toward Fair Speech Technologies: A Comprehensive Survey of Bias and Fairness in Speech AI
eess.AS 2026-05 accept novelty 7.0

The paper delivers a unified framework for fairness in speech technologies by formalizing seven definitions, organizing research into three paradigms, diagnosing pipeline-specific biases, and mapping mitigations to th...
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
cs.SD 2026-04 unverdicted novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

read this joyfully

INTRODUCTION Instruction-guided text-to-speech (ITTS) [1,2] enables users to steer speech synthesis using natural-language prompts (e.g., “read this joyfully” or “speak like a child”). This approach offers a trans- parent and flexible alternative to conventional TTS pipelines [3, 4] that often require low-level acoustic controls or specialized labels for ...

work page
[2]

RELA TED WORKS AND BACKGROUND 2.1. ITTS Systems and Selection The field ofInstruction-guided Text-to-Speech(ITTS) has seen rapid advancement, with many models capable of generating speech from descriptive prompts [1]. Although robust systems such as Audiobox

work page
[3]

Therefore, to ensure a comprehensive and replicable analysis, we selected five represen- tative models across three distinct categories

exist, their closed-source nature limits transparency and repro- ducibility, which are essential for this study. Therefore, to ensure a comprehensive and replicable analysis, we selected five represen- tative models across three distinct categories. First, to represent the state-of-the-art in open-source research, we includedParler-TTS

work page
[4]

Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems

andPromptTTS++[11]. These models are publicly available, allowing for the in-depth analysis required for this study. Second, to represent the leading edge of commercial ITTS systems, we in- corporateGPT-4o-mini-TTS[12]. Its efficient and high-quality API provides an insightful analysis for production-grade expressive syn- thesis. Finally, to test the capa...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

slightly,

EV ALUA TION FRAMEWORK We designed a comprehensive evaluation framework to investigate the instruction-perception gap in ITTS systematically. This frame- work consists of 3 core components: the control dimensions that define the evaluation tasks (Section 3.1), the evaluation metrics used to quantify alignment (Section 3.2), and the E-VOC corpus of human p...

work page arXiv
[6]

slightly

EXPERIMENTAL RESULTS AND ANALYSES 4.1. Adverbs of Degree As shown in Figure 1, gpt-4o provides the clearest and most con- sistent mapping from degree adverbs to acoustic features. Figure 2 (top row) extends this analysis to perceived emotion intensity under adverb cues. Loudness.gpt-4o spans a wide LUFS range with a predictable or- dering from “slightly” ...

work page
[7]

slightly happy

CONCLUSION AND FUTURE WORK Conclusion.This work addresses the largely unexplored link be- tween natural-language instructions and listener perception in ITTS. We proposed a novel framework for evaluating fine-grained control using adverbs of degree (e.g., “slightly happy”) and ordered emo- tional adjectives (e.g., from “Content” to “Happy” to “Ecstatic”)....

work page
[8]

Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt,

Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, and Helen Meng, “Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2913–2925, 2024

work page 2024
[9]

Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training,

Zhihao Du et al., “Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training,” 2025

work page 2025
[10]

FastSpeech 2: Fast and High-Quality End- to-End Text to Speech,

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “FastSpeech 2: Fast and High-Quality End- to-End Text to Speech,” inInternational Conference on Learn- ing Representations, 2021

work page 2021
[11]

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero- Shot V oice Conversion for Everyone,

Edresson Casanova, Julian Weber, Christopher D Shulby, Ar- naldo Candido Junior, Eren G ¨olge, and Moacir A Ponti, “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero- Shot V oice Conversion for Everyone,” inProceedings of the 39th International Conference on Machine Learning, Kama- lika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu,...

work page 2022
[12]

MOSNet: Deep Learning-Based Objective Assessment for V oice Conver- sion,

Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Ju- nichi Yamagishi, Yu Tsao, and Hsin-Min Wang, “MOSNet: Deep Learning-Based Objective Assessment for V oice Conver- sion,” inInterspeech 2019, 2019, pp. 1541–1545

work page 2019
[13]

NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,

Gabriel Mittag et al., “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” inInterspeech 2021, 2021

work page 2021
[14]

HighRateMOS: Sampling-Rate Aware Modeling for Speech Quality Assessment,

Wenze Ren, Yi-Cheng Lin, Wen-Chin Huang, Ryandhimas E. Zezario, Szu-Wei Fu, Sung-Feng Huang, Erica Cooper, Haibin Wu, Hung-Yu Wei, Hsin-Min Wang, Hung yi Lee, and Yu Tsao, “HighRateMOS: Sampling-Rate Aware Modeling for Speech Quality Assessment,” 2025

work page 2025
[15]

Predicting speech intelligibility of enhanced speech us- ing phone accuracy of dnn-based asr system,

Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Ki- noshita, Tomohiro Nakatani, Katsuhiko Yamamoto, and Toshio Irino, “Predicting speech intelligibility of enhanced speech us- ing phone accuracy of dnn-based asr system,” inInterspeech 2019, 2019, pp. 4275–4279

work page 2019
[16]

Audiobox: Unified Audio Generation with Natural Language Prompts,

Apoorv Vyas et al., “Audiobox: Unified Audio Generation with Natural Language Prompts,” 2023

work page 2023
[17]

Natural language guidance of high-fidelity text-to-speech with synthetic annotations,

Dan Lyth and Simon King, “Natural language guidance of high-fidelity text-to-speech with synthetic annotations,” 2024

work page 2024
[18]

PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-To-Speech Using Natural Lan- guage Descriptions,

Reo Shimizu et al., “PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-To-Speech Using Natural Lan- guage Descriptions,” inICASSP 2024 - 2024 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 12672–12676

work page 2024
[19]

Introducing next-generation audio models in the API,

OpenAI, “Introducing next-generation audio models in the API,” March 2025

work page 2025
[20]

UniAudio: An Audio Foundation Model Toward Universal Audio Generation,

Dongchao Yang et al., “UniAudio: An Audio Foundation Model Toward Universal Audio Generation,” 2024

work page 2024
[21]

EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech Via Emotion-Adaptive Spherical Vector,

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, and Seong-Whan Lee, “EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech Via Emotion-Adaptive Spherical Vector,”IEEE Transactions on Affective Computing, vol. 16, no. 3, pp. 2365–2380, 2025

work page 2025
[22]

emotion2vec: Self- Supervised Pre-Training for Speech Emotion Representation,

Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, ShiLiang Zhang, and Xie Chen, “emotion2vec: Self- Supervised Pre-Training for Speech Emotion Representation,” inFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar, Eds., Bangkok, Thailand, Aug. 2024, pp. 15747–15760, Asso- ciation f...

work page 2024
[23]

V oxInstruct: Ex- pressive Human Instruction-to-Speech Generation with Uni- fied Multilingual Codec Language Modelling,

Yixuan Zhou, Xiaoyu Qin, Zeyu Jin, Shuoyi Zhou, Shun Lei, Songtao Zhou, Zhiyong Wu, and Jia Jia, “V oxInstruct: Ex- pressive Human Instruction-to-Speech Generation with Uni- fied Multilingual Codec Language Modelling,” inProceed- ings of the 32nd ACM International Conference on Multime- dia, New York, NY , USA, 2024, MM ’24, p. 554–563, Associ- ation for ...

work page 2024
[24]

EmoV oice: LLM-based Emotional Text-To- Speech Model with Freestyle Text Prompting,

Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, and Xie Chen, “EmoV oice: LLM-based Emotional Text-To- Speech Model with Freestyle Text Prompting,” 2025

work page 2025
[25]

SpeechCraft: A Fine- Grained Expressive Speech Dataset with Natural Language Description,

Zeyu Jin, Jia Jia, Qixin Wang, Kehan Li, Shuoyi Zhou, Songtao Zhou, Xiaoyu Qin, and Zhiyong Wu, “SpeechCraft: A Fine- Grained Expressive Speech Dataset with Natural Language Description,” inProceedings of the 32nd ACM International Conference on Multimedia, New York, NY , USA, 2024, MM ’24, p. 1255–1264, Association for Computing Machinery

work page 2024
[26]

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems,

Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, et al., “InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems,” 2025

work page 2025
[27]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities,

Gheorghe Comanici et al., “Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities,” 2025

work page 2025
[28]

Word Affect Intensities,

Saif Mohammad, “Word Affect Intensities,” inProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018, Eu- ropean Language Resources Association (ELRA)

work page 2018
[29]

English Wikipedia database dump,

The Wikipedia contributors, “English Wikipedia database dump,” Available:https://dumps.wikimedia.org/ enwiki/20230413/, 2023, Accessed: Apr. 3, 2025

work page arXiv 2023
[30]

Crepe: A Convolutional Representation for Pitch Esti- mation,

Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello, “Crepe: A Convolutional Representation for Pitch Esti- mation,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 161–165

work page 2018
[31]

Parler-TTS,

Yoach Lacombe, Vaibhav Srivastav, and Sanchit Gandhi, “Parler-TTS,”https://github.com/huggingface/ parler-tts, 2024

work page 2024
[32]

GPT-4o mini TTS,

OpenAI, “GPT-4o mini TTS,” Text-to-Speech Model Docu- mentation, 2025, Version accessed on March 4, 2025

work page 2025
[33]

CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset,

Houwei Cao, David G. Cooper, Michael K. Keutmann, Ruben C. Gur, Ani Nenkova, and Ragini Verma, “CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset,”IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377– 390, 2014

work page 2014
[34]

50.5 Hours — English (America) Children Scripted Monologue Microphone Speech Dataset,

NEXDATA AI, “50.5 Hours — English (America) Children Scripted Monologue Microphone Speech Dataset,”https://www.nexdata.ai/datasets/ speechrecog/75?source=Github, 2025

work page 2025
[35]

Audio-Aware Large Language Models as Judges for Speaking Styles,

Cheng-Han Chiang, Xiaofei Wang, et al., “Audio-Aware Large Language Models as Judges for Speaking Styles,” 2025

work page 2025

[1] [1]

read this joyfully

INTRODUCTION Instruction-guided text-to-speech (ITTS) [1,2] enables users to steer speech synthesis using natural-language prompts (e.g., “read this joyfully” or “speak like a child”). This approach offers a trans- parent and flexible alternative to conventional TTS pipelines [3, 4] that often require low-level acoustic controls or specialized labels for ...

work page

[2] [2]

RELA TED WORKS AND BACKGROUND 2.1. ITTS Systems and Selection The field ofInstruction-guided Text-to-Speech(ITTS) has seen rapid advancement, with many models capable of generating speech from descriptive prompts [1]. Although robust systems such as Audiobox

work page

[3] [3]

Therefore, to ensure a comprehensive and replicable analysis, we selected five represen- tative models across three distinct categories

exist, their closed-source nature limits transparency and repro- ducibility, which are essential for this study. Therefore, to ensure a comprehensive and replicable analysis, we selected five represen- tative models across three distinct categories. First, to represent the state-of-the-art in open-source research, we includedParler-TTS

work page

[4] [4]

Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems

andPromptTTS++[11]. These models are publicly available, allowing for the in-depth analysis required for this study. Second, to represent the leading edge of commercial ITTS systems, we in- corporateGPT-4o-mini-TTS[12]. Its efficient and high-quality API provides an insightful analysis for production-grade expressive syn- thesis. Finally, to test the capa...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

slightly,

EV ALUA TION FRAMEWORK We designed a comprehensive evaluation framework to investigate the instruction-perception gap in ITTS systematically. This frame- work consists of 3 core components: the control dimensions that define the evaluation tasks (Section 3.1), the evaluation metrics used to quantify alignment (Section 3.2), and the E-VOC corpus of human p...

work page arXiv

[6] [6]

slightly

EXPERIMENTAL RESULTS AND ANALYSES 4.1. Adverbs of Degree As shown in Figure 1, gpt-4o provides the clearest and most con- sistent mapping from degree adverbs to acoustic features. Figure 2 (top row) extends this analysis to perceived emotion intensity under adverb cues. Loudness.gpt-4o spans a wide LUFS range with a predictable or- dering from “slightly” ...

work page

[7] [7]

slightly happy

CONCLUSION AND FUTURE WORK Conclusion.This work addresses the largely unexplored link be- tween natural-language instructions and listener perception in ITTS. We proposed a novel framework for evaluating fine-grained control using adverbs of degree (e.g., “slightly happy”) and ordered emo- tional adjectives (e.g., from “Content” to “Happy” to “Ecstatic”)....

work page

[8] [8]

Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt,

Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, and Helen Meng, “Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2913–2925, 2024

work page 2024

[9] [9]

Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training,

Zhihao Du et al., “Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training,” 2025

work page 2025

[10] [10]

FastSpeech 2: Fast and High-Quality End- to-End Text to Speech,

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “FastSpeech 2: Fast and High-Quality End- to-End Text to Speech,” inInternational Conference on Learn- ing Representations, 2021

work page 2021

[11] [11]

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero- Shot V oice Conversion for Everyone,

Edresson Casanova, Julian Weber, Christopher D Shulby, Ar- naldo Candido Junior, Eren G ¨olge, and Moacir A Ponti, “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero- Shot V oice Conversion for Everyone,” inProceedings of the 39th International Conference on Machine Learning, Kama- lika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu,...

work page 2022

[12] [12]

MOSNet: Deep Learning-Based Objective Assessment for V oice Conver- sion,

Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Ju- nichi Yamagishi, Yu Tsao, and Hsin-Min Wang, “MOSNet: Deep Learning-Based Objective Assessment for V oice Conver- sion,” inInterspeech 2019, 2019, pp. 1541–1545

work page 2019

[13] [13]

NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,

Gabriel Mittag et al., “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” inInterspeech 2021, 2021

work page 2021

[14] [14]

HighRateMOS: Sampling-Rate Aware Modeling for Speech Quality Assessment,

Wenze Ren, Yi-Cheng Lin, Wen-Chin Huang, Ryandhimas E. Zezario, Szu-Wei Fu, Sung-Feng Huang, Erica Cooper, Haibin Wu, Hung-Yu Wei, Hsin-Min Wang, Hung yi Lee, and Yu Tsao, “HighRateMOS: Sampling-Rate Aware Modeling for Speech Quality Assessment,” 2025

work page 2025

[15] [15]

Predicting speech intelligibility of enhanced speech us- ing phone accuracy of dnn-based asr system,

Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Ki- noshita, Tomohiro Nakatani, Katsuhiko Yamamoto, and Toshio Irino, “Predicting speech intelligibility of enhanced speech us- ing phone accuracy of dnn-based asr system,” inInterspeech 2019, 2019, pp. 4275–4279

work page 2019

[16] [16]

Audiobox: Unified Audio Generation with Natural Language Prompts,

Apoorv Vyas et al., “Audiobox: Unified Audio Generation with Natural Language Prompts,” 2023

work page 2023

[17] [17]

Natural language guidance of high-fidelity text-to-speech with synthetic annotations,

Dan Lyth and Simon King, “Natural language guidance of high-fidelity text-to-speech with synthetic annotations,” 2024

work page 2024

[18] [18]

PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-To-Speech Using Natural Lan- guage Descriptions,

Reo Shimizu et al., “PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-To-Speech Using Natural Lan- guage Descriptions,” inICASSP 2024 - 2024 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 12672–12676

work page 2024

[19] [19]

Introducing next-generation audio models in the API,

OpenAI, “Introducing next-generation audio models in the API,” March 2025

work page 2025

[20] [20]

UniAudio: An Audio Foundation Model Toward Universal Audio Generation,

Dongchao Yang et al., “UniAudio: An Audio Foundation Model Toward Universal Audio Generation,” 2024

work page 2024

[21] [21]

EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech Via Emotion-Adaptive Spherical Vector,

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, and Seong-Whan Lee, “EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech Via Emotion-Adaptive Spherical Vector,”IEEE Transactions on Affective Computing, vol. 16, no. 3, pp. 2365–2380, 2025

work page 2025

[22] [22]

emotion2vec: Self- Supervised Pre-Training for Speech Emotion Representation,

Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, ShiLiang Zhang, and Xie Chen, “emotion2vec: Self- Supervised Pre-Training for Speech Emotion Representation,” inFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar, Eds., Bangkok, Thailand, Aug. 2024, pp. 15747–15760, Asso- ciation f...

work page 2024

[23] [23]

V oxInstruct: Ex- pressive Human Instruction-to-Speech Generation with Uni- fied Multilingual Codec Language Modelling,

Yixuan Zhou, Xiaoyu Qin, Zeyu Jin, Shuoyi Zhou, Shun Lei, Songtao Zhou, Zhiyong Wu, and Jia Jia, “V oxInstruct: Ex- pressive Human Instruction-to-Speech Generation with Uni- fied Multilingual Codec Language Modelling,” inProceed- ings of the 32nd ACM International Conference on Multime- dia, New York, NY , USA, 2024, MM ’24, p. 554–563, Associ- ation for ...

work page 2024

[24] [24]

EmoV oice: LLM-based Emotional Text-To- Speech Model with Freestyle Text Prompting,

Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, and Xie Chen, “EmoV oice: LLM-based Emotional Text-To- Speech Model with Freestyle Text Prompting,” 2025

work page 2025

[25] [25]

SpeechCraft: A Fine- Grained Expressive Speech Dataset with Natural Language Description,

Zeyu Jin, Jia Jia, Qixin Wang, Kehan Li, Shuoyi Zhou, Songtao Zhou, Xiaoyu Qin, and Zhiyong Wu, “SpeechCraft: A Fine- Grained Expressive Speech Dataset with Natural Language Description,” inProceedings of the 32nd ACM International Conference on Multimedia, New York, NY , USA, 2024, MM ’24, p. 1255–1264, Association for Computing Machinery

work page 2024

[26] [26]

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems,

Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, et al., “InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems,” 2025

work page 2025

[27] [27]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities,

Gheorghe Comanici et al., “Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities,” 2025

work page 2025

[28] [28]

Word Affect Intensities,

Saif Mohammad, “Word Affect Intensities,” inProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018, Eu- ropean Language Resources Association (ELRA)

work page 2018

[29] [29]

English Wikipedia database dump,

The Wikipedia contributors, “English Wikipedia database dump,” Available:https://dumps.wikimedia.org/ enwiki/20230413/, 2023, Accessed: Apr. 3, 2025

work page arXiv 2023

[30] [30]

Crepe: A Convolutional Representation for Pitch Esti- mation,

Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello, “Crepe: A Convolutional Representation for Pitch Esti- mation,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 161–165

work page 2018

[31] [31]

Parler-TTS,

Yoach Lacombe, Vaibhav Srivastav, and Sanchit Gandhi, “Parler-TTS,”https://github.com/huggingface/ parler-tts, 2024

work page 2024

[32] [32]

GPT-4o mini TTS,

OpenAI, “GPT-4o mini TTS,” Text-to-Speech Model Docu- mentation, 2025, Version accessed on March 4, 2025

work page 2025

[33] [33]

CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset,

Houwei Cao, David G. Cooper, Michael K. Keutmann, Ruben C. Gur, Ani Nenkova, and Ragini Verma, “CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset,”IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377– 390, 2014

work page 2014

[34] [34]

50.5 Hours — English (America) Children Scripted Monologue Microphone Speech Dataset,

NEXDATA AI, “50.5 Hours — English (America) Children Scripted Monologue Microphone Speech Dataset,”https://www.nexdata.ai/datasets/ speechrecog/75?source=Github, 2025

work page 2025

[35] [35]

Audio-Aware Large Language Models as Judges for Speaking Styles,

Cheng-Han Chiang, Xiaofei Wang, et al., “Audio-Aware Large Language Models as Judges for Speaking Styles,” 2025

work page 2025