pith. sign in

arxiv: 2606.00066 · v1 · pith:NFRTCUAEnew · submitted 2026-05-20 · 💻 cs.SD · eess.AS

DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech

Pith reviewed 2026-06-30 17:32 UTC · model grok-4.3

classification 💻 cs.SD eess.AS
keywords text-to-speechemotion controldiffusion modelsflow matchingplug-and-playdual-space steeringpretrained modelsspeech synthesis
0
0 comments X

The pith

A plug-and-play framework steers pretrained diffusion and flow-matching TTS models along linearly decodable emotion directions in hidden and mel spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that emotion information appears as a linear direction within the frozen hidden states of TTS models and stays nearly orthogonal to speaker identity. This allows a unified dual-space method called DUET to insert fine-grained emotion control into existing diffusion and flow-matching systems during generation without any retraining. The approach combines one-step hidden-space shifts toward the target emotion with gradient-based refinement of spectral details drawn from a differentiable vocoder. Experiments across five architecturally different backbones and three datasets show it surpasses ten supervised emotional TTS baselines while recording the highest human ratings for emotion appropriateness. The same system also produces expressive speech when deployed on a humanoid robot.

Core claim

We discover that emotion embedding emerges as a linearly decodable direction of frozen hidden states, nearly orthogonal to the direction embedding speaker identity. This inspires DUET, a plug-and-play framework for emotion control over pretrained diffusion and flow-matching based TTS models. During generation, DUET unifies dual-space control to achieve fine-grained emotion intervention in a single per-step update: hidden space steering shifts generation along the target emotion direction, while mel-space guidance refines spectral details through gradients backpropagated from a differentiable vocoder. We validate DUET on five architecturally diverse pretrained TTS backbones across three datas

What carries the argument

Dual-space control that performs hidden-space steering along the linearly decodable emotion direction while applying mel-space guidance via gradients from a differentiable vocoder.

If this is right

  • Pretrained diffusion and flow-matching TTS models can receive emotion control without retraining or architectural changes.
  • Fine-grained emotion intervention occurs inside a single per-step update during inference.
  • The method exceeds the performance of supervised emotional TTS systems trained specifically for the task.
  • The same plug-and-play control extends to embodied platforms such as humanoid robots.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If comparable linear directions exist for other attributes such as prosody or accent, the same steering approach could add those controls without new training.
  • Widespread adoption would lower the barrier to building expressive speech systems by removing the need for emotion-specific labeled datasets.
  • Applying the orthogonality test to additional generative audio models outside TTS would clarify how general the linear-decodability property is.

Load-bearing premise

Emotion information forms a linearly decodable direction inside the frozen hidden states of pretrained TTS models and stays nearly orthogonal to speaker identity.

What would settle it

If steering along the extracted emotion direction produces no measurable rise in human-rated emotion appropriateness on the tested backbones, or if the extracted direction proves linearly entangled with speaker identity in the hidden states.

Figures

Figures reproduced from arXiv: 2606.00066 by Longbing Cao, Xu Zhang, Zhangkai Wu.

Figure 1
Figure 1. Figure 1: Emotion is subtle but speaker-consistent and linearly decodable in pretrained TTS hidden states. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Colored segments show the accuracy achieved by DUET, the top line marks the GT ceiling of the dataset, and gray segments indicate the gap between them. The noticeably larger gap on angry pinpoints the challenge of capturing its sharp temporal dynamics. (b) Left: scatter panel of hidden states colored by emotion, with speakers distributed uniformly within each cluster, indicating emotion and speaker sub… view at source ↗
Figure 3
Figure 3. Figure 3: Plug-and-play deployment of DUET on the Ameca humanoid robot. For each of three emotions, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Diffusion and flow-matching based text-to-speech (TTS) models excel in naturalness but often lack explicit emotion control, as emotional signals remain entangled with speaker identity. We discover that emotion embedding emerges as a linearly decodable direction of frozen hidden states, nearly orthogonal to the direction embedding speaker identity. This inspires a plug-and-play framework DUET for emotion control over pretrained diffusion and flow-matching based TTS models. During generation, DUET unifies dual-space control to achieve fine-grained emotion intervention in a single per-step update: hidden space steering shifts generation along the target emotion direction, while mel-space guidance refines spectral details through gradients backpropagated from a differentiable vocoder. We validate DUET on five architecturally diverse pretrained TTS backbones across three datasets, where it outperforms 10 supervised state-of-the-art emotional TTS baselines across paradigms and achieves the highest human-rated emotion appropriateness. To further showcase its qualitative behavior, we deploy DUET on an Ameca humanoid robot, where it produces richly expressive emotional speech on the humanoid, demonstrating the strong potential for plug-and-play affective interaction for embodied agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to discover that emotion embeddings emerge as linearly decodable directions in frozen hidden states of diffusion/flow-matching TTS models and are nearly orthogonal to speaker-identity directions; this discovery motivates the DUET plug-and-play framework that performs unified dual-space (hidden + mel) emotion control via a single per-step update, outperforming 10 supervised SOTA emotional TTS baselines on five architecturally diverse pretrained backbones across three datasets while achieving the highest human-rated emotion appropriateness, with an additional qualitative demonstration on an Ameca humanoid robot.

Significance. If the linear-decodability and near-orthogonality findings hold with the reported stability across timesteps and backbones, DUET would constitute a meaningful advance by enabling training-free, fine-grained emotion intervention on existing high-quality TTS models without retraining or architectural changes.

major comments (2)
  1. [Abstract] Abstract: the central premise that 'emotion embedding emerges as a linearly decodable direction of frozen hidden states, nearly orthogonal to the direction embedding speaker identity' is asserted without any supporting numerical evidence (cosine similarity, linear-probe accuracy, variance explained, or stability across diffusion/flow timesteps), which is load-bearing for the plug-and-play claim and the asserted advantage of the single per-step update.
  2. [Abstract] Abstract: the validation statement ('outperforms 10 supervised state-of-the-art emotional TTS baselines across paradigms and achieves the highest human-rated emotion appropriateness') supplies no methods, datasets, metrics, error bars, or statistical tests, preventing assessment of whether the outperformance claim is supported.
minor comments (1)
  1. The abstract lists 'five architecturally diverse pretrained TTS backbones across three datasets' but does not name the backbones or datasets, which would aid reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments on the abstract. Both points identify opportunities to make the abstract more self-contained while preserving its brevity. We address each below and will revise the abstract accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central premise that 'emotion embedding emerges as a linearly decodable direction of frozen hidden states, nearly orthogonal to the direction embedding speaker identity' is asserted without any supporting numerical evidence (cosine similarity, linear-probe accuracy, variance explained, or stability across diffusion/flow timesteps), which is load-bearing for the plug-and-play claim and the asserted advantage of the single per-step update.

    Authors: We agree that the abstract would benefit from brief quantitative anchors for this discovery. The main text (Section 3.2 and Figure 2) reports linear-probe accuracies, cosine similarities between emotion and speaker directions, variance explained, and timestep stability across backbones. We will revise the abstract to include the key supporting numbers (e.g., probe accuracy and average cosine similarity) so the premise is not asserted without evidence. revision: yes

  2. Referee: [Abstract] Abstract: the validation statement ('outperforms 10 supervised state-of-the-art emotional TTS baselines across paradigms and achieves the highest human-rated emotion appropriateness') supplies no methods, datasets, metrics, error bars, or statistical tests, preventing assessment of whether the outperformance claim is supported.

    Authors: We acknowledge that the abstract summarizes results without specifying the exact evaluation protocol. The full manuscript details the five backbones, three datasets, 10 baselines, metrics (including human ratings with statistical tests), and error bars in Sections 4 and 5. To improve readability of the abstract, we will add concise qualifiers such as 'on five backbones across three datasets with human ratings (n=XX, p<0.05)' while remaining within length limits. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical discovery presented as independent premise

full rationale

The provided abstract and description frame the linear decodability and near-orthogonality of emotion vs. speaker directions as an empirical discovery from frozen hidden states of pretrained models. This observation is used to motivate the DUET plug-and-play method, with subsequent validation on five diverse backbones, three datasets, and comparisons to 10 baselines. No equations, self-citations, or derivations are exhibited that reduce the claimed results or the steering mechanism to the inputs by construction (e.g., no fitted parameter renamed as prediction, no self-definitional loop, no uniqueness theorem imported from the same authors). The central premise is externally falsifiable via probe accuracies or cosine similarities on held-out models, satisfying the criteria for independent support. Honest non-finding applies.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities can be identified from the abstract alone.

pith-pipeline@v0.9.1-grok · 5731 in / 1130 out tokens · 46002 ms · 2026-06-30T17:32:38.345711+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 8 canonical work pages · 4 internal anchors

  1. [1]

    Universal guidance for diffusion models

    Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023

  2. [2]

    SEGA: Instructing text-to-image models using semantic guidance

    Manuel Brack, Felix Friedrich, Dominik Hintersdorf, Lukas Struppek, Patrick Schramowski, and Kristian Kersting. SEGA: Instructing text-to-image models using semantic guidance. InAdvances in Neural Information Processing Systems, 2023

  3. [3]

    Chang, Sungbok Lee, and Shrikanth S

    Carlos Busso, Murtaza Bulut, Chi-Chung Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008

  4. [4]

    Cooper, Michael K

    Houwei Cao, David G. Cooper, Michael K. Keutmann, Ruben C. Gur, Ani Nenkova, and Ragini Verma. CREMA- D: Crowd-sourced emotional multimodal actors dataset.IEEE Transactions on Affective Computing, 2014

  5. [5]

    Humanoid robots and humanoid AI: Review, perspectives and directions.ACM Computing Surveys, 2025

    Longbing Cao. Humanoid robots and humanoid AI: Review, perspectives and directions.ACM Computing Surveys, 2025

  6. [6]

    EmoKnob: Enhance voice cloning with fine-grained emotion control

    Haozhe Chen, Run Chen, and Julia Hirschberg. EmoKnob: Enhance voice cloning with fine-grained emotion control. InConference on Empirical Methods in Natural Language Processing, 2024

  7. [7]

    WavLM: Large-scale self-supervised pre-training for full stack speech processing

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal P...

  8. [8]

    F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching

    Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching. InAnnual Meeting of the Association for Computational Linguistics, 2025

  9. [9]

    EmoSphere-TTS: Emotional style and intensity modeling via spherical emotion vector for controllable emotional text-to-speech

    Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Sang-Hoon Lee, and Seong-Whan Lee. EmoSphere-TTS: Emotional style and intensity modeling via spherical emotion vector for controllable emotional text-to-speech. In Annual Conference of the International Speech Communication Association, 2024

  10. [10]

    DiEmo-TTS: Disentangled emotion representations via self-supervised distillation for cross-speaker emotion transfer in text-to-speech

    Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, and Seong-Whan Lee. DiEmo-TTS: Disentangled emotion representations via self-supervised distillation for cross-speaker emotion transfer in text-to-speech. InAnnual Conference of the International Speech Communication Association, 2025

  11. [11]

    Diffusion models beat GANs on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. InAdvances in Neural Information Processing Systems, 2021. 9 DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech

  12. [12]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, and Jingren Zhou. CosyV oice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024

  13. [13]

    EmoDiff: Intensity controllable emotional text-to-speech with soft-label guidance

    Yiwei Guo, Chenpeng Du, Xie Chen, and Kai Yu. EmoDiff: Intensity controllable emotional text-to-speech with soft-label guidance. InIEEE International Conference on Acoustics, Speech and Signal Processing, 2023

  14. [14]

    Language models represent space and time

    Wes Gurnee and Max Tegmark. Language models represent space and time. InInternational Conference on Learning Representations, 2024

  15. [15]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems, 2020

  16. [16]

    HuBERT: Self-supervised speech representation learning by masked prediction of hidden units

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021

  17. [17]

    Qwen3-TTS Technical Report

    Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin. Qwen3-TTS technical report.arXiv preprint arXiv:2601.15621, 2026

  18. [18]

    ProDiff: Progressive fast diffusion model for high-quality text-to-speech

    Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, and Yi Ren. ProDiff: Progressive fast diffusion model for high-quality text-to-speech. InACM International Conference on Multimedia, 2022

  19. [19]

    Guided-TTS: A diffusion model for text-to-speech via classifier guidance

    Heeseung Kim, Sungwon Kim, and Sungroh Yoon. Guided-TTS: A diffusion model for text-to-speech via classifier guidance. InInternational Conference on Machine Learning, 2022

  20. [20]

    Inference-time intervention: Eliciting truthful answers from a language model

    Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. InAdvances in Neural Information Processing Systems, 2023

  21. [21]

    UGotMe: An embodied system for affective human-robot interaction

    Peizhen Li, Longbing Cao, Xiao-Ming Wu, Xiaohan Yu, and Runze Yang. UGotMe: An embodied system for affective human-robot interaction. InIEEE International Conference on Robotics and Automation, 2025

  22. [22]

    Raghavan, Gavin Mischler, and Nima Mesgarani

    Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, and Nima Mesgarani. StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. In Advances in Neural Information Processing Systems, 2023

  23. [23]

    Emotion2vec: Self-supervised pre-training for speech emotion representation

    Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, ShiLiang Zhang, and Xie Chen. Emotion2vec: Self-supervised pre-training for speech emotion representation. InFindings of the Association for Computational Linguistics, 2024

  24. [24]

    Matcha-TTS: A fast TTS architecture with conditional flow matching

    Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter. Matcha-TTS: A fast TTS architecture with conditional flow matching. InIEEE International Conference on Acoustics, Speech and Signal Processing, 2024

  25. [25]

    Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, and Nicholas J. Bryan. DITTO: Diffusion inference- time T-optimization for music generation. InInternational Conference on Machine Learning, 2024

  26. [26]

    The linear representation hypothesis and the geometry of large language models

    Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InInternational Conference on Machine Learning, 2024

  27. [27]

    Grad-TTS: A diffusion probabilistic model for text-to-speech

    Vadim Popov, Ivan V ovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. Grad-TTS: A diffusion probabilistic model for text-to-speech. InInternational Conference on Machine Learning, 2021

  28. [28]

    V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis

    Hubert Siuzdak. V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis. InInternational Conference on Learning Representations, 2024

  29. [29]

    R. J. Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, and Rif A. Saurous. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. In International Conference on Machine Learning, 2018

  30. [30]

    EmoMix: Emotion mixing via diffusion models for emotional speech synthesis

    Haobin Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, and Jing Xiao. EmoMix: Emotion mixing via diffusion models for emotional speech synthesis. InAnnual Conference of the International Speech Communication Association, 2023

  31. [31]

    Li, Arnab Sen Sharma, Aaron Mueller, Byron C

    Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. InInternational Conference on Learning Representations, 2024

  32. [32]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248, 2023. 10 DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech

  33. [33]

    Yuxuan Wang, Daisy Stanton, Yu Zhang, R. J. Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A. Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. InInternational Conference on Machine Learning, 2018

  34. [34]

    ProgDiffusion: Progressively self-encoding diffusion models

    Zhangkai Wu, Xuhui Fan, and Longbing Cao. ProgDiffusion: Progressively self-encoding diffusion models. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1633–1644, 2025

  35. [35]

    SepDiff: Self-encoding parameter diffusion for learning latent semantics

    Zhangkai Wu, Xuhui Fan, Jin Li, Zhilin Zhao, Hui Chen, and Longbing Cao. SepDiff: Self-encoding parameter diffusion for learning latent semantics. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3273–3284, 2025

  36. [36]

    SCoT: Unifying consistency models and rectified flows via straight-consistent trajectories

    Zhangkai Wu, Xuhui Fan, Hongyu Wu, and Longbing Cao. SCoT: Unifying consistency models and rectified flows via straight-consistent trajectories. InAdvances in Neural Information Processing Systems, 2025

  37. [37]

    Manning, and Christopher Potts

    Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. ReFT: Representation finetuning for language models. InAdvances in Neural Information Processing Systems, 2024

  38. [38]

    Emosteer- tts: Fine-grained and training-free emotion-controllable text-to-speech via activation steering.arXiv preprint arXiv:2508.03543,

    Tianxin Xie, Shan Yang, Chenxing Li, Dong Yu, and Li Liu. EmoSteer-TTS: Fine-grained and training-free emotion-controllable text-to-speech via activation steering.arXiv preprint arXiv:2508.03543, 2025

  39. [39]

    EmoV oice: LLM-based emotional text-to-speech model with freestyle text prompting

    Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, and Xie Chen. EmoV oice: LLM-based emotional text-to-speech model with freestyle text prompting. InACM International Conference on Multimedia, 2025

  40. [40]

    TFG: Unified training-free guidance for diffusion models

    Haotian Ye, Haowei Lin, Jiaqi Han, Minkai Xu, Sheng Liu, Yitao Liang, Jianzhu Ma, James Zou, and Stefano Ermon. TFG: Unified training-free guidance for diffusion models. InAdvances in Neural Information Processing Systems, 2024

  41. [41]

    Learning physiology-informed vocal spectrotemporal representations for speech emotion recognition.arXiv preprint arXiv:2602.13259, 2026

    Xu Zhang, Longbing Cao, Runze Yang, and Zhangkai Wu. Learning physiology-informed vocal spectrotemporal representations for speech emotion recognition.arXiv preprint arXiv:2602.13259, 2026

  42. [42]

    Emotional voice conversion: Theory, databases and ESD

    Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. Emotional voice conversion: Theory, databases and ESD. Speech Communication, 2022

  43. [43]

    EmoShift: Lightweight activation steering for enhanced emotion-aware speech synthesis.arXiv preprint arXiv:2601.22873, 2026

    Li Zhou, Hao Jiang, Junjie Li, Tianrui Wang, and Haizhou Li. EmoShift: Lightweight activation steering for enhanced emotion-aware speech synthesis.arXiv preprint arXiv:2601.22873, 2026

  44. [44]

    Shangguan et al

    Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, and Jingchen Shu. IndexTTS2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech.arXiv preprint arXiv:2506.21619, 2025

  45. [45]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to A...