DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech

Longbing Cao; Xu Zhang; Zhangkai Wu

arxiv: 2606.00066 · v1 · pith:NFRTCUAEnew · submitted 2026-05-20 · 💻 cs.SD · eess.AS

DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech

Xu Zhang , Longbing Cao , Zhangkai Wu This is my paper

Pith reviewed 2026-06-30 17:32 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords text-to-speechemotion controldiffusion modelsflow matchingplug-and-playdual-space steeringpretrained modelsspeech synthesis

0 comments

The pith

A plug-and-play framework steers pretrained diffusion and flow-matching TTS models along linearly decodable emotion directions in hidden and mel spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that emotion information appears as a linear direction within the frozen hidden states of TTS models and stays nearly orthogonal to speaker identity. This allows a unified dual-space method called DUET to insert fine-grained emotion control into existing diffusion and flow-matching systems during generation without any retraining. The approach combines one-step hidden-space shifts toward the target emotion with gradient-based refinement of spectral details drawn from a differentiable vocoder. Experiments across five architecturally different backbones and three datasets show it surpasses ten supervised emotional TTS baselines while recording the highest human ratings for emotion appropriateness. The same system also produces expressive speech when deployed on a humanoid robot.

Core claim

We discover that emotion embedding emerges as a linearly decodable direction of frozen hidden states, nearly orthogonal to the direction embedding speaker identity. This inspires DUET, a plug-and-play framework for emotion control over pretrained diffusion and flow-matching based TTS models. During generation, DUET unifies dual-space control to achieve fine-grained emotion intervention in a single per-step update: hidden space steering shifts generation along the target emotion direction, while mel-space guidance refines spectral details through gradients backpropagated from a differentiable vocoder. We validate DUET on five architecturally diverse pretrained TTS backbones across three datas

What carries the argument

Dual-space control that performs hidden-space steering along the linearly decodable emotion direction while applying mel-space guidance via gradients from a differentiable vocoder.

If this is right

Pretrained diffusion and flow-matching TTS models can receive emotion control without retraining or architectural changes.
Fine-grained emotion intervention occurs inside a single per-step update during inference.
The method exceeds the performance of supervised emotional TTS systems trained specifically for the task.
The same plug-and-play control extends to embodied platforms such as humanoid robots.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If comparable linear directions exist for other attributes such as prosody or accent, the same steering approach could add those controls without new training.
Widespread adoption would lower the barrier to building expressive speech systems by removing the need for emotion-specific labeled datasets.
Applying the orthogonality test to additional generative audio models outside TTS would clarify how general the linear-decodability property is.

Load-bearing premise

Emotion information forms a linearly decodable direction inside the frozen hidden states of pretrained TTS models and stays nearly orthogonal to speaker identity.

What would settle it

If steering along the extracted emotion direction produces no measurable rise in human-rated emotion appropriateness on the tested backbones, or if the extracted direction proves linearly entangled with speaker identity in the hidden states.

Figures

Figures reproduced from arXiv: 2606.00066 by Longbing Cao, Xu Zhang, Zhangkai Wu.

**Figure 2.** Figure 2: (a) Colored segments show the accuracy achieved by DUET, the top line marks the GT ceiling of the dataset, and gray segments indicate the gap between them. The noticeably larger gap on angry pinpoints the challenge of capturing its sharp temporal dynamics. (b) Left: scatter panel of hidden states colored by emotion, with speakers distributed uniformly within each cluster, indicating emotion and speaker sub… view at source ↗

**Figure 3.** Figure 3: Plug-and-play deployment of DUET on the Ameca humanoid robot. For each of three emotions, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Diffusion and flow-matching based text-to-speech (TTS) models excel in naturalness but often lack explicit emotion control, as emotional signals remain entangled with speaker identity. We discover that emotion embedding emerges as a linearly decodable direction of frozen hidden states, nearly orthogonal to the direction embedding speaker identity. This inspires a plug-and-play framework DUET for emotion control over pretrained diffusion and flow-matching based TTS models. During generation, DUET unifies dual-space control to achieve fine-grained emotion intervention in a single per-step update: hidden space steering shifts generation along the target emotion direction, while mel-space guidance refines spectral details through gradients backpropagated from a differentiable vocoder. We validate DUET on five architecturally diverse pretrained TTS backbones across three datasets, where it outperforms 10 supervised state-of-the-art emotional TTS baselines across paradigms and achieves the highest human-rated emotion appropriateness. To further showcase its qualitative behavior, we deploy DUET on an Ameca humanoid robot, where it produces richly expressive emotional speech on the humanoid, demonstrating the strong potential for plug-and-play affective interaction for embodied agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DUET's plug-and-play pitch rests on whether emotion really forms a stable linear direction nearly orthogonal to speaker in frozen states across timesteps.

read the letter

The main takeaway is that DUET adds emotion control to existing diffusion and flow-matching TTS models without retraining by steering along a linear emotion direction in hidden space while adding mel-space guidance from a vocoder, all in one update per step.

What stands out as new is the explicit unification of those two spaces into a single framework that applies to multiple pretrained backbones. The paper tests this on five architecturally different models across three datasets and reports better human-rated emotion appropriateness than ten supervised emotional TTS baselines. The robot deployment on Ameca is a concrete way to show the output in an embodied setting.

The soft spot is the central premise that emotion emerges as a linearly decodable direction nearly orthogonal to speaker identity in the frozen hidden states. The abstract states this observation but the stress-test note correctly flags the lack of reported numbers such as cosine similarities, probe accuracies, or variance explained, plus no check on stability across diffusion timesteps. If the full paper supplies those measurements and they hold up, the method is practical; if they are weak or model-specific, the no-retraining advantage shrinks.

The work is aimed at researchers building controllable TTS or affective agents who already have strong base models. It deserves a serious referee because the experimental scope is wide and the claims are falsifiable with the right probes and ablations.

Referee Report

2 major / 1 minor

Summary. The paper claims to discover that emotion embeddings emerge as linearly decodable directions in frozen hidden states of diffusion/flow-matching TTS models and are nearly orthogonal to speaker-identity directions; this discovery motivates the DUET plug-and-play framework that performs unified dual-space (hidden + mel) emotion control via a single per-step update, outperforming 10 supervised SOTA emotional TTS baselines on five architecturally diverse pretrained backbones across three datasets while achieving the highest human-rated emotion appropriateness, with an additional qualitative demonstration on an Ameca humanoid robot.

Significance. If the linear-decodability and near-orthogonality findings hold with the reported stability across timesteps and backbones, DUET would constitute a meaningful advance by enabling training-free, fine-grained emotion intervention on existing high-quality TTS models without retraining or architectural changes.

major comments (2)

[Abstract] Abstract: the central premise that 'emotion embedding emerges as a linearly decodable direction of frozen hidden states, nearly orthogonal to the direction embedding speaker identity' is asserted without any supporting numerical evidence (cosine similarity, linear-probe accuracy, variance explained, or stability across diffusion/flow timesteps), which is load-bearing for the plug-and-play claim and the asserted advantage of the single per-step update.
[Abstract] Abstract: the validation statement ('outperforms 10 supervised state-of-the-art emotional TTS baselines across paradigms and achieves the highest human-rated emotion appropriateness') supplies no methods, datasets, metrics, error bars, or statistical tests, preventing assessment of whether the outperformance claim is supported.

minor comments (1)

The abstract lists 'five architecturally diverse pretrained TTS backbones across three datasets' but does not name the backbones or datasets, which would aid reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments on the abstract. Both points identify opportunities to make the abstract more self-contained while preserving its brevity. We address each below and will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central premise that 'emotion embedding emerges as a linearly decodable direction of frozen hidden states, nearly orthogonal to the direction embedding speaker identity' is asserted without any supporting numerical evidence (cosine similarity, linear-probe accuracy, variance explained, or stability across diffusion/flow timesteps), which is load-bearing for the plug-and-play claim and the asserted advantage of the single per-step update.

Authors: We agree that the abstract would benefit from brief quantitative anchors for this discovery. The main text (Section 3.2 and Figure 2) reports linear-probe accuracies, cosine similarities between emotion and speaker directions, variance explained, and timestep stability across backbones. We will revise the abstract to include the key supporting numbers (e.g., probe accuracy and average cosine similarity) so the premise is not asserted without evidence. revision: yes
Referee: [Abstract] Abstract: the validation statement ('outperforms 10 supervised state-of-the-art emotional TTS baselines across paradigms and achieves the highest human-rated emotion appropriateness') supplies no methods, datasets, metrics, error bars, or statistical tests, preventing assessment of whether the outperformance claim is supported.

Authors: We acknowledge that the abstract summarizes results without specifying the exact evaluation protocol. The full manuscript details the five backbones, three datasets, 10 baselines, metrics (including human ratings with statistical tests), and error bars in Sections 4 and 5. To improve readability of the abstract, we will add concise qualifiers such as 'on five backbones across three datasets with human ratings (n=XX, p<0.05)' while remaining within length limits. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical discovery presented as independent premise

full rationale

The provided abstract and description frame the linear decodability and near-orthogonality of emotion vs. speaker directions as an empirical discovery from frozen hidden states of pretrained models. This observation is used to motivate the DUET plug-and-play method, with subsequent validation on five diverse backbones, three datasets, and comparisons to 10 baselines. No equations, self-citations, or derivations are exhibited that reduce the claimed results or the steering mechanism to the inputs by construction (e.g., no fitted parameter renamed as prediction, no self-definitional loop, no uniqueness theorem imported from the same authors). The central premise is externally falsifiable via probe accuracies or cosine similarities on held-out models, satisfying the criteria for independent support. Honest non-finding applies.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities can be identified from the abstract alone.

pith-pipeline@v0.9.1-grok · 5731 in / 1130 out tokens · 46002 ms · 2026-06-30T17:32:38.345711+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 8 canonical work pages · 4 internal anchors

[1]

Universal guidance for diffusion models

Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023

2023
[2]

SEGA: Instructing text-to-image models using semantic guidance

Manuel Brack, Felix Friedrich, Dominik Hintersdorf, Lukas Struppek, Patrick Schramowski, and Kristian Kersting. SEGA: Instructing text-to-image models using semantic guidance. InAdvances in Neural Information Processing Systems, 2023

2023
[3]

Chang, Sungbok Lee, and Shrikanth S

Carlos Busso, Murtaza Bulut, Chi-Chung Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008

2008
[4]

Cooper, Michael K

Houwei Cao, David G. Cooper, Michael K. Keutmann, Ruben C. Gur, Ani Nenkova, and Ragini Verma. CREMA- D: Crowd-sourced emotional multimodal actors dataset.IEEE Transactions on Affective Computing, 2014

2014
[5]

Humanoid robots and humanoid AI: Review, perspectives and directions.ACM Computing Surveys, 2025

Longbing Cao. Humanoid robots and humanoid AI: Review, perspectives and directions.ACM Computing Surveys, 2025

2025
[6]

EmoKnob: Enhance voice cloning with fine-grained emotion control

Haozhe Chen, Run Chen, and Julia Hirschberg. EmoKnob: Enhance voice cloning with fine-grained emotion control. InConference on Empirical Methods in Natural Language Processing, 2024

2024
[7]

WavLM: Large-scale self-supervised pre-training for full stack speech processing

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal P...

2022
[8]

F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching. InAnnual Meeting of the Association for Computational Linguistics, 2025

2025
[9]

EmoSphere-TTS: Emotional style and intensity modeling via spherical emotion vector for controllable emotional text-to-speech

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Sang-Hoon Lee, and Seong-Whan Lee. EmoSphere-TTS: Emotional style and intensity modeling via spherical emotion vector for controllable emotional text-to-speech. In Annual Conference of the International Speech Communication Association, 2024

2024
[10]

DiEmo-TTS: Disentangled emotion representations via self-supervised distillation for cross-speaker emotion transfer in text-to-speech

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, and Seong-Whan Lee. DiEmo-TTS: Disentangled emotion representations via self-supervised distillation for cross-speaker emotion transfer in text-to-speech. InAnnual Conference of the International Speech Communication Association, 2025

2025
[11]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. InAdvances in Neural Information Processing Systems, 2021. 9 DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech

2021
[12]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, and Jingren Zhou. CosyV oice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

EmoDiff: Intensity controllable emotional text-to-speech with soft-label guidance

Yiwei Guo, Chenpeng Du, Xie Chen, and Kai Yu. EmoDiff: Intensity controllable emotional text-to-speech with soft-label guidance. InIEEE International Conference on Acoustics, Speech and Signal Processing, 2023

2023
[14]

Language models represent space and time

Wes Gurnee and Max Tegmark. Language models represent space and time. InInternational Conference on Learning Representations, 2024

2024
[15]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems, 2020

2020
[16]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021

2021
[17]

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin. Qwen3-TTS technical report.arXiv preprint arXiv:2601.15621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

ProDiff: Progressive fast diffusion model for high-quality text-to-speech

Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, and Yi Ren. ProDiff: Progressive fast diffusion model for high-quality text-to-speech. InACM International Conference on Multimedia, 2022

2022
[19]

Guided-TTS: A diffusion model for text-to-speech via classifier guidance

Heeseung Kim, Sungwon Kim, and Sungroh Yoon. Guided-TTS: A diffusion model for text-to-speech via classifier guidance. InInternational Conference on Machine Learning, 2022

2022
[20]

Inference-time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. InAdvances in Neural Information Processing Systems, 2023

2023
[21]

UGotMe: An embodied system for affective human-robot interaction

Peizhen Li, Longbing Cao, Xiao-Ming Wu, Xiaohan Yu, and Runze Yang. UGotMe: An embodied system for affective human-robot interaction. InIEEE International Conference on Robotics and Automation, 2025

2025
[22]

Raghavan, Gavin Mischler, and Nima Mesgarani

Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, and Nima Mesgarani. StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. In Advances in Neural Information Processing Systems, 2023

2023
[23]

Emotion2vec: Self-supervised pre-training for speech emotion representation

Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, ShiLiang Zhang, and Xie Chen. Emotion2vec: Self-supervised pre-training for speech emotion representation. InFindings of the Association for Computational Linguistics, 2024

2024
[24]

Matcha-TTS: A fast TTS architecture with conditional flow matching

Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter. Matcha-TTS: A fast TTS architecture with conditional flow matching. InIEEE International Conference on Acoustics, Speech and Signal Processing, 2024

2024
[25]

Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, and Nicholas J. Bryan. DITTO: Diffusion inference- time T-optimization for music generation. InInternational Conference on Machine Learning, 2024

2024
[26]

The linear representation hypothesis and the geometry of large language models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InInternational Conference on Machine Learning, 2024

2024
[27]

Grad-TTS: A diffusion probabilistic model for text-to-speech

Vadim Popov, Ivan V ovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. Grad-TTS: A diffusion probabilistic model for text-to-speech. InInternational Conference on Machine Learning, 2021

2021
[28]

V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis

Hubert Siuzdak. V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis. InInternational Conference on Learning Representations, 2024

2024
[29]

R. J. Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, and Rif A. Saurous. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. In International Conference on Machine Learning, 2018

2018
[30]

EmoMix: Emotion mixing via diffusion models for emotional speech synthesis

Haobin Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, and Jing Xiao. EmoMix: Emotion mixing via diffusion models for emotional speech synthesis. InAnnual Conference of the International Speech Communication Association, 2023

2023
[31]

Li, Arnab Sen Sharma, Aaron Mueller, Byron C

Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. InInternational Conference on Learning Representations, 2024

2024
[32]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248, 2023. 10 DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Yuxuan Wang, Daisy Stanton, Yu Zhang, R. J. Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A. Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. InInternational Conference on Machine Learning, 2018

2018
[34]

ProgDiffusion: Progressively self-encoding diffusion models

Zhangkai Wu, Xuhui Fan, and Longbing Cao. ProgDiffusion: Progressively self-encoding diffusion models. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1633–1644, 2025

2025
[35]

SepDiff: Self-encoding parameter diffusion for learning latent semantics

Zhangkai Wu, Xuhui Fan, Jin Li, Zhilin Zhao, Hui Chen, and Longbing Cao. SepDiff: Self-encoding parameter diffusion for learning latent semantics. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3273–3284, 2025

2025
[36]

SCoT: Unifying consistency models and rectified flows via straight-consistent trajectories

Zhangkai Wu, Xuhui Fan, Hongyu Wu, and Longbing Cao. SCoT: Unifying consistency models and rectified flows via straight-consistent trajectories. InAdvances in Neural Information Processing Systems, 2025

2025
[37]

Manning, and Christopher Potts

Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. ReFT: Representation finetuning for language models. InAdvances in Neural Information Processing Systems, 2024

2024
[38]

Emosteer- tts: Fine-grained and training-free emotion-controllable text-to-speech via activation steering.arXiv preprint arXiv:2508.03543,

Tianxin Xie, Shan Yang, Chenxing Li, Dong Yu, and Li Liu. EmoSteer-TTS: Fine-grained and training-free emotion-controllable text-to-speech via activation steering.arXiv preprint arXiv:2508.03543, 2025

work page arXiv 2025
[39]

EmoV oice: LLM-based emotional text-to-speech model with freestyle text prompting

Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, and Xie Chen. EmoV oice: LLM-based emotional text-to-speech model with freestyle text prompting. InACM International Conference on Multimedia, 2025

2025
[40]

TFG: Unified training-free guidance for diffusion models

Haotian Ye, Haowei Lin, Jiaqi Han, Minkai Xu, Sheng Liu, Yitao Liang, Jianzhu Ma, James Zou, and Stefano Ermon. TFG: Unified training-free guidance for diffusion models. InAdvances in Neural Information Processing Systems, 2024

2024
[41]

Learning physiology-informed vocal spectrotemporal representations for speech emotion recognition.arXiv preprint arXiv:2602.13259, 2026

Xu Zhang, Longbing Cao, Runze Yang, and Zhangkai Wu. Learning physiology-informed vocal spectrotemporal representations for speech emotion recognition.arXiv preprint arXiv:2602.13259, 2026

work page arXiv 2026
[42]

Emotional voice conversion: Theory, databases and ESD

Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. Emotional voice conversion: Theory, databases and ESD. Speech Communication, 2022

2022
[43]

EmoShift: Lightweight activation steering for enhanced emotion-aware speech synthesis.arXiv preprint arXiv:2601.22873, 2026

Li Zhou, Hao Jiang, Junjie Li, Tianrui Wang, and Haizhou Li. EmoShift: Lightweight activation steering for enhanced emotion-aware speech synthesis.arXiv preprint arXiv:2601.22873, 2026

work page arXiv 2026
[44]

Shangguan et al

Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, and Jingchen Shu. IndexTTS2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech.arXiv preprint arXiv:2506.21619, 2025

work page arXiv 2025
[45]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to A...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Universal guidance for diffusion models

Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023

2023

[2] [2]

SEGA: Instructing text-to-image models using semantic guidance

Manuel Brack, Felix Friedrich, Dominik Hintersdorf, Lukas Struppek, Patrick Schramowski, and Kristian Kersting. SEGA: Instructing text-to-image models using semantic guidance. InAdvances in Neural Information Processing Systems, 2023

2023

[3] [3]

Chang, Sungbok Lee, and Shrikanth S

Carlos Busso, Murtaza Bulut, Chi-Chung Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008

2008

[4] [4]

Cooper, Michael K

Houwei Cao, David G. Cooper, Michael K. Keutmann, Ruben C. Gur, Ani Nenkova, and Ragini Verma. CREMA- D: Crowd-sourced emotional multimodal actors dataset.IEEE Transactions on Affective Computing, 2014

2014

[5] [5]

Humanoid robots and humanoid AI: Review, perspectives and directions.ACM Computing Surveys, 2025

Longbing Cao. Humanoid robots and humanoid AI: Review, perspectives and directions.ACM Computing Surveys, 2025

2025

[6] [6]

EmoKnob: Enhance voice cloning with fine-grained emotion control

Haozhe Chen, Run Chen, and Julia Hirschberg. EmoKnob: Enhance voice cloning with fine-grained emotion control. InConference on Empirical Methods in Natural Language Processing, 2024

2024

[7] [7]

WavLM: Large-scale self-supervised pre-training for full stack speech processing

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal P...

2022

[8] [8]

F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching. InAnnual Meeting of the Association for Computational Linguistics, 2025

2025

[9] [9]

EmoSphere-TTS: Emotional style and intensity modeling via spherical emotion vector for controllable emotional text-to-speech

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Sang-Hoon Lee, and Seong-Whan Lee. EmoSphere-TTS: Emotional style and intensity modeling via spherical emotion vector for controllable emotional text-to-speech. In Annual Conference of the International Speech Communication Association, 2024

2024

[10] [10]

DiEmo-TTS: Disentangled emotion representations via self-supervised distillation for cross-speaker emotion transfer in text-to-speech

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, and Seong-Whan Lee. DiEmo-TTS: Disentangled emotion representations via self-supervised distillation for cross-speaker emotion transfer in text-to-speech. InAnnual Conference of the International Speech Communication Association, 2025

2025

[11] [11]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. InAdvances in Neural Information Processing Systems, 2021. 9 DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech

2021

[12] [12]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, and Jingren Zhou. CosyV oice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

EmoDiff: Intensity controllable emotional text-to-speech with soft-label guidance

Yiwei Guo, Chenpeng Du, Xie Chen, and Kai Yu. EmoDiff: Intensity controllable emotional text-to-speech with soft-label guidance. InIEEE International Conference on Acoustics, Speech and Signal Processing, 2023

2023

[14] [14]

Language models represent space and time

Wes Gurnee and Max Tegmark. Language models represent space and time. InInternational Conference on Learning Representations, 2024

2024

[15] [15]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems, 2020

2020

[16] [16]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021

2021

[17] [17]

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin. Qwen3-TTS technical report.arXiv preprint arXiv:2601.15621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

ProDiff: Progressive fast diffusion model for high-quality text-to-speech

Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, and Yi Ren. ProDiff: Progressive fast diffusion model for high-quality text-to-speech. InACM International Conference on Multimedia, 2022

2022

[19] [19]

Guided-TTS: A diffusion model for text-to-speech via classifier guidance

Heeseung Kim, Sungwon Kim, and Sungroh Yoon. Guided-TTS: A diffusion model for text-to-speech via classifier guidance. InInternational Conference on Machine Learning, 2022

2022

[20] [20]

Inference-time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. InAdvances in Neural Information Processing Systems, 2023

2023

[21] [21]

UGotMe: An embodied system for affective human-robot interaction

Peizhen Li, Longbing Cao, Xiao-Ming Wu, Xiaohan Yu, and Runze Yang. UGotMe: An embodied system for affective human-robot interaction. InIEEE International Conference on Robotics and Automation, 2025

2025

[22] [22]

Raghavan, Gavin Mischler, and Nima Mesgarani

Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, and Nima Mesgarani. StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. In Advances in Neural Information Processing Systems, 2023

2023

[23] [23]

Emotion2vec: Self-supervised pre-training for speech emotion representation

Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, ShiLiang Zhang, and Xie Chen. Emotion2vec: Self-supervised pre-training for speech emotion representation. InFindings of the Association for Computational Linguistics, 2024

2024

[24] [24]

Matcha-TTS: A fast TTS architecture with conditional flow matching

Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter. Matcha-TTS: A fast TTS architecture with conditional flow matching. InIEEE International Conference on Acoustics, Speech and Signal Processing, 2024

2024

[25] [25]

Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, and Nicholas J. Bryan. DITTO: Diffusion inference- time T-optimization for music generation. InInternational Conference on Machine Learning, 2024

2024

[26] [26]

The linear representation hypothesis and the geometry of large language models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InInternational Conference on Machine Learning, 2024

2024

[27] [27]

Grad-TTS: A diffusion probabilistic model for text-to-speech

Vadim Popov, Ivan V ovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. Grad-TTS: A diffusion probabilistic model for text-to-speech. InInternational Conference on Machine Learning, 2021

2021

[28] [28]

V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis

Hubert Siuzdak. V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis. InInternational Conference on Learning Representations, 2024

2024

[29] [29]

R. J. Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, and Rif A. Saurous. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. In International Conference on Machine Learning, 2018

2018

[30] [30]

EmoMix: Emotion mixing via diffusion models for emotional speech synthesis

Haobin Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, and Jing Xiao. EmoMix: Emotion mixing via diffusion models for emotional speech synthesis. InAnnual Conference of the International Speech Communication Association, 2023

2023

[31] [31]

Li, Arnab Sen Sharma, Aaron Mueller, Byron C

Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. InInternational Conference on Learning Representations, 2024

2024

[32] [32]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248, 2023. 10 DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Yuxuan Wang, Daisy Stanton, Yu Zhang, R. J. Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A. Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. InInternational Conference on Machine Learning, 2018

2018

[34] [34]

ProgDiffusion: Progressively self-encoding diffusion models

Zhangkai Wu, Xuhui Fan, and Longbing Cao. ProgDiffusion: Progressively self-encoding diffusion models. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1633–1644, 2025

2025

[35] [35]

SepDiff: Self-encoding parameter diffusion for learning latent semantics

Zhangkai Wu, Xuhui Fan, Jin Li, Zhilin Zhao, Hui Chen, and Longbing Cao. SepDiff: Self-encoding parameter diffusion for learning latent semantics. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3273–3284, 2025

2025

[36] [36]

SCoT: Unifying consistency models and rectified flows via straight-consistent trajectories

Zhangkai Wu, Xuhui Fan, Hongyu Wu, and Longbing Cao. SCoT: Unifying consistency models and rectified flows via straight-consistent trajectories. InAdvances in Neural Information Processing Systems, 2025

2025

[37] [37]

Manning, and Christopher Potts

Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. ReFT: Representation finetuning for language models. InAdvances in Neural Information Processing Systems, 2024

2024

[38] [38]

Emosteer- tts: Fine-grained and training-free emotion-controllable text-to-speech via activation steering.arXiv preprint arXiv:2508.03543,

Tianxin Xie, Shan Yang, Chenxing Li, Dong Yu, and Li Liu. EmoSteer-TTS: Fine-grained and training-free emotion-controllable text-to-speech via activation steering.arXiv preprint arXiv:2508.03543, 2025

work page arXiv 2025

[39] [39]

EmoV oice: LLM-based emotional text-to-speech model with freestyle text prompting

Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, and Xie Chen. EmoV oice: LLM-based emotional text-to-speech model with freestyle text prompting. InACM International Conference on Multimedia, 2025

2025

[40] [40]

TFG: Unified training-free guidance for diffusion models

Haotian Ye, Haowei Lin, Jiaqi Han, Minkai Xu, Sheng Liu, Yitao Liang, Jianzhu Ma, James Zou, and Stefano Ermon. TFG: Unified training-free guidance for diffusion models. InAdvances in Neural Information Processing Systems, 2024

2024

[41] [41]

Learning physiology-informed vocal spectrotemporal representations for speech emotion recognition.arXiv preprint arXiv:2602.13259, 2026

Xu Zhang, Longbing Cao, Runze Yang, and Zhangkai Wu. Learning physiology-informed vocal spectrotemporal representations for speech emotion recognition.arXiv preprint arXiv:2602.13259, 2026

work page arXiv 2026

[42] [42]

Emotional voice conversion: Theory, databases and ESD

Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. Emotional voice conversion: Theory, databases and ESD. Speech Communication, 2022

2022

[43] [43]

EmoShift: Lightweight activation steering for enhanced emotion-aware speech synthesis.arXiv preprint arXiv:2601.22873, 2026

Li Zhou, Hao Jiang, Junjie Li, Tianrui Wang, and Haizhou Li. EmoShift: Lightweight activation steering for enhanced emotion-aware speech synthesis.arXiv preprint arXiv:2601.22873, 2026

work page arXiv 2026

[44] [44]

Shangguan et al

Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, and Jingchen Shu. IndexTTS2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech.arXiv preprint arXiv:2506.21619, 2025

work page arXiv 2025

[45] [45]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to A...

work page internal anchor Pith review Pith/arXiv arXiv 2025