pith. sign in

arxiv: 2605.17583 · v1 · pith:E625IQ2Wnew · submitted 2026-05-14 · 💻 cs.CV

AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech

Pith reviewed 2026-05-20 21:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-speechmulti-agent frameworkcomposite instructionsadversarial disentanglementacoustic prototypesclosed-loop controlexpressive speech synthesisintent control
0
0 comments X

The pith

A multi-agent closed-loop framework separates speaker identity from emotion and anchors composite text intents to acoustic prototypes for more faithful TTS output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AgentSteerTTS to address the mismatch between discrete textual instructions and continuous speech features in text-to-speech systems. It deploys three cooperating agents in a feedback loop: one uses adversarial training to isolate speaker identity from emotion and prosody, another retrieves and blends examples from an acoustic library to match abstract intents, and a third corrects mismatches via gradient steps and perceptual review. A sympathetic reader would care because composite instructions like combining specific emotion with speaker traits often produce leakage or drift in current models. If the approach holds, generated speech would more reliably reflect detailed instructions without manual tuning of parameters.

Core claim

AgentSteerTTS is a multi-agent closed-loop framework for intent-faithful expressive control of composite instructions in TTS. An adversarial disentanglement agent mitigates speaker-emotion leakage by learning separable identity and emotion-prosody subspaces with leakage-suppressing regularization. A Dual-Stream Anchoring Controller grounds abstract intents using a large-scale acoustic prototype library, where a Retrieval Agent selects expressive anchors and a Synthesis Agent fuses them into continuous control vectors via gated attention. A Fast-Slow Feedback Agent refines output intensity through latent gradient correction and resolves semantic-acoustic mismatches using high-level perceptual

What carries the argument

The multi-agent closed-loop framework that combines an adversarial disentanglement agent for subspace separation, a dual-stream anchoring controller with retrieval and synthesis agents over an acoustic prototype library, and a fast-slow feedback agent for refinement.

If this is right

  • Yields consistent and significant improvements to baselines on a composite-instruction benchmark and public test sets.
  • Reduces speaker-emotion leakage through regularization in the disentanglement stage.
  • Improves grounding of abstract intents by fusing selected anchors from the acoustic prototype library.
  • Resolves semantic-acoustic mismatches via the fast-slow feedback mechanism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The closed-loop structure could support iterative refinement in live applications where a user corrects the output mid-generation.
  • Scaling the prototype library with more diverse recordings might extend coverage to rarer composite intents not seen in training.
  • Similar agent decomposition might apply to other conditional generation tasks such as controllable video synthesis from mixed instructions.

Load-bearing premise

The framework assumes an adversarial disentanglement agent can reliably learn separable identity and emotion-prosody subspaces without residual leakage and that the acoustic prototype library provides sufficient coverage for grounding arbitrary composite intents.

What would settle it

Run the composite-instruction benchmark with outputs evaluated for measurable leakage between speaker identity and emotion-prosody features; if leakage remains high or scores show no consistent gain over baselines, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.17583 by Bin Kang, Junjie Wang, Junle Wang, Junzhi Zhao, Shaoguo Wen, Shunlong Wu, Yang Fan, Yulin Li, Zhuotao Tian.

Figure 1
Figure 1. Figure 1: Illustration of Semantic-Acoustic Misalignment. The red dashed line delineates the target emotion, while the colored regions represent the generated speech. Despite high-level control instructions, the model exhibits under-expression along the target emotion dimension, while unintentionally leaking acoustic energy into irrelevant emotional dimensions. 1. Introduction Recent advances in speech and multimoda… view at source ↗
Figure 2
Figure 2. Figure 2: Emotion expression bias. Target emotion dimensions are suppressed, while non-target dimensions exhibit consistent positive leakage across conditions. et al., 2025; Ju et al., 2024; Gao et al., 2025) that casts expressive TTS as a feed-forward mapping from text (op￾tionally augmented with style prompts) to waveforms. Many such methods rely on fine-grained supervision, such as dense prosody annotations or me… view at source ↗
Figure 3
Figure 3. Figure 3: Speaker–emotion entanglement: Composite-intent fi￾delity (C-SIM) is negatively correlated with the speaker-fidelity proxy (S-SIM). tally fail to reliably control continuous acoustic realizations for fine-grained composite instructions (e.g., "Happy but slightly Arrogant"). 2.1. Why do deterministic mappings fail for composite instructions control? We model speech generation as a stochastic process jointly … view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the AGENTSTEERTTS architecture. (a) an Adversarial Disentanglement Module that utilizes gradient reversal layers to construct orthogonal speaker and emotion subspaces; (b) a Dual-Stream Anchoring Controller that aligns latent features with retrieved acoustic prototypes via consistency calibration; and (c) a Fast-Slow Feedback Loop that dynamically refines the generation process during inference… view at source ↗
Figure 5
Figure 5. Figure 5: Speaker–emotion entanglement before/after ADM disen￾tanglement. ADM reduces speaker drift while maintaining emotion alignment, mitigating the identity–emotion trade-off. off: improving composite alignment often comes with larger speaker drift. After ADM, the distribution shifts toward lower drift while largely preserving emotion alignment, pro￾viding direct experimental evidence that disentanglement stabil… view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity of confidence-gated fusion β(δ). Sensitivity to Retrieval Confidence [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Fast Agent efficiency–effectiveness trade-off. Fast-Loop Efficiency–Effectiveness Trade-off [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Spectral and prosodic evidence of compositional control. Under “Happy but slightly Arrogant”, retains Happy’s high-frequency energy and Arrogant’s restrained resonance, with matched F0/energy contours. AGENTSTEERTTS explicitly addresses two key challenges: (1) the structural mismatch between discrete textual intents and continuous acoustic realizations, and (2) entanglement between speaker identity and pro… view at source ↗
Figure 9
Figure 9. Figure 9: Composite-instruction radar plots in the 6-D emotion space. We overlay the target composite vector with the mean extracted emotion vector from multiple generations, highlighting target-dimension suppression and non-target leakage under discrete prompting. A.2. Composite Feasible Region under Composite Instructions To probe whether composite-control failures reflect distribution-level collapse, we visualize… view at source ↗
Figure 10
Figure 10. Figure 10: Composite feasible regions across six composite emotion instructions. Compared with Text-only (neutral collapse) and Retrieval-only (scattered modes), our full system shifts and concentrates the output distribution into the target region, consistent with reduced collapse under composite control. α to [0, 1] and run T=2 Fast-Loop gradient steps with γ=5 × 10−3 unless stated otherwise. The Slow Loop leverag… view at source ↗
Figure 11
Figure 11. Figure 11: Speaker–emotion entanglement before/after ADM disentanglement on 500 labeled utterances. We show a four-panel t-SNE layout: zid colored by speaker/emotion and zemo colored by emotion/speaker. After ADM, speaker-driven separation in zemo weakens while emotion structure remains more salient, consistent with reduced identity leakage into the emotion space [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Attribute energy allocation under a composite instruction (“Happy but slightly Arrogant”). Compared to the baseline, AGENTSTEERTTS concentrates energy on the target dimensions (higher target mass, lower leakage) and improves temporal stability. the baseline (w/o disentanglement) exhibits typical attribute entanglement failure modes: target emotion cues are partially suppressed while non-target spectral re… view at source ↗
Figure 13
Figure 13. Figure 13: Baseline vs. AGENTSTEERTTS under Happy+Arrogant: our method produces more localized, intent-aligned spectral and prosodic changes, while the baseline shows diluted or inconsistent edits. 0 0.5 1 1.5 2 2.5 Time (s) 0 512 1024 2048 4096 8192 Frequency (Hz) (a) Baseline (w/o Disentanglement) 0 0.5 1 1.5 2 2.5 Time (s) 0 512 1024 2048 4096 8192 (b) Ours (with ADM Disentanglement) 80 70 60 50 40 30 20 10 0 Mag… view at source ↗
Figure 14
Figure 14. Figure 14: Baseline vs. AGENTSTEERTTS under Sad+Hopeful: our disentangled control reduces attribute leakage and yields more stable prosody under composition. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Baseline vs. AGENTSTEERTTS under Angry+Restrained: our method preserves speaker-related structure while applying targeted emotion-relevant modifications. 0 0.15 0.3 0.45 0.6 0.75 0.9 1.1 1.2 Time (s) 0 512 1024 2048 4096 8192 Frequency (Hz) (a) Baseline (w/o Disentanglement) 0 0.15 0.3 0.45 0.6 0.75 0.9 1.1 1.2 Time (s) 0 512 1024 2048 4096 8192 (b) Ours (with ADM Disentanglement) 80 70 60 50 40 30 20 10 … view at source ↗
Figure 16
Figure 16. Figure 16: Baseline vs. AGENTSTEERTTS under Surprised+Fearful: our method improves compositional fidelity by avoiding over￾smoothed or globally perturbed patterns. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Compositional control on Angry+Restrained; localized spectral edits and consistent prosody are confirmed by (g) and composability scores in (h). 0 0.5 1 1.5 2 2.5 3 3.5 4 Time (s) 0 512 1024 2048 4096 8192 Frequency (Hz) (a) Base (Neutral) 0 0.5 1 1.5 2 2.5 3 3.5 4 Time (s) 0 512 1024 2048 4096 8192 (b) Single: Angry 0 0.5 1 1.5 2 2.5 3 3.5 4 Time (s) 0 512 1024 2048 4096 8192 (c) Single: Calm 0 0.5 1 1.5… view at source ↗
Figure 18
Figure 18. Figure 18: Compositional control on Angry+Restrained; localized spectral edits and consistent prosody are confirmed by (g) and composability scores in (h). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Generalization on Surprised+Fearful: the composite preserves non-trivial spectral/prosodic cues, supported by localized edits in (g) and scores in (h). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗
read the original abstract

While existing text-to-speech (TTS) models exhibit high expressiveness, fine-grained control over composite instructions remains challenging due to the structural mismatch between discrete textual intents and continuous acoustic realizations. Inspired by human cognitive decoupling, we introduce AgentSteerTTS, a multi-agent closed-loop framework designed for intent-faithful expressive control of composite instructions. First, in our framework, an adversarial disentanglement agent mitigates speaker-emotion leakage by learning separable identity and emotion-prosody subspaces with leakage-suppressing regularization. Next, a Dual-Stream Anchoring Controller grounds abstract intents using a large-scale acoustic prototype library: a Retrieval Agent selects expressive anchors, while a Synthesis Agent fuses them into continuous control vectors via gated attention. Finally, a Fast-Slow Feedback Agent refines output intensity through latent gradient correction and resolves semantic-acoustic mismatches using high-level perceptual critique. Experiments on a composite-instruction benchmark and public test sets show that AgentSteerTTS yields consistent and significant improvements to the baselines, demonstrating the effectiveness of the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AgentSteerTTS, a multi-agent closed-loop framework for composite-instruction text-to-speech synthesis. It first employs an adversarial disentanglement agent to learn separable identity and emotion-prosody subspaces via leakage-suppressing regularization. A Dual-Stream Anchoring Controller then grounds abstract intents by retrieving expressive anchors from a large-scale acoustic prototype library and fusing them into control vectors with gated attention. A Fast-Slow Feedback Agent refines intensity via latent gradient correction and resolves mismatches with perceptual critique. Experiments on a composite-instruction benchmark and public test sets are reported to yield consistent and significant improvements over baselines.

Significance. If the disentanglement produces cleanly separable subspaces and the prototype library provides sufficient coverage, the framework could meaningfully advance fine-grained, intent-faithful control in expressive TTS, addressing the discrete-to-continuous mismatch that limits current models. The closed-loop multi-agent structure and explicit anchoring mechanism represent a distinct architectural choice that, if validated, may generalize to other conditional generation tasks.

major comments (2)
  1. [Abstract and §3.1 (adversarial disentanglement agent)] The central claim of consistent improvements rests on the adversarial disentanglement agent producing cleanly separable subspaces. The abstract and method description mention leakage-suppressing regularization, yet no quantitative verification (e.g., correlation coefficients, mutual information, or orthogonality metrics between identity and emotion-prosody embeddings) is provided to confirm residual leakage is negligible in the regimes required for composite instructions. If leakage persists, the Dual-Stream Anchoring Controller cannot ground intents independently, undermining the reported gains.
  2. [§4 (Experiments)] The experimental results assert 'consistent and significant improvements' on a composite-instruction benchmark and public test sets, but the manuscript provides no tabulated metrics, baseline comparisons, error bars, or statistical tests in the visible description. This absence makes it impossible to evaluate effect sizes or rule out post-hoc selection effects.
minor comments (2)
  1. [§3.2] Clarify the precise definition and training objective of the gated attention fusion in the Synthesis Agent, including any hyper-parameters that control the balance between retrieved anchors.
  2. [§3.2] Provide details on the scale, construction, and coverage statistics of the large-scale acoustic prototype library to substantiate the claim that it supports arbitrary composite intents.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below, indicating the specific revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3.1 (adversarial disentanglement agent)] The central claim of consistent improvements rests on the adversarial disentanglement agent producing cleanly separable subspaces. The abstract and method description mention leakage-suppressing regularization, yet no quantitative verification (e.g., correlation coefficients, mutual information, or orthogonality metrics between identity and emotion-prosody embeddings) is provided to confirm residual leakage is negligible in the regimes required for composite instructions. If leakage persists, the Dual-Stream Anchoring Controller cannot ground intents independently, undermining the reported gains.

    Authors: We agree that quantitative verification of subspace separability is essential to support the claims. In the revised manuscript we will add explicit metrics, including Pearson correlation coefficients, mutual information estimates, and orthogonality measures (e.g., cosine similarity or Gram-matrix off-diagonal norms) computed between the identity and emotion-prosody embeddings on both training and held-out composite-instruction data. These additions will directly demonstrate that residual leakage is negligible under the conditions used for the reported experiments. revision: yes

  2. Referee: [§4 (Experiments)] The experimental results assert 'consistent and significant improvements' on a composite-instruction benchmark and public test sets, but the manuscript provides no tabulated metrics, baseline comparisons, error bars, or statistical tests in the visible description. This absence makes it impossible to evaluate effect sizes or rule out post-hoc selection effects.

    Authors: We acknowledge that the current presentation lacks sufficient tabular detail for independent evaluation. In the revised version we will include comprehensive result tables that report all objective and subjective metrics for AgentSteerTTS and every baseline, together with standard deviations or confidence intervals, and p-values from appropriate statistical tests (e.g., paired t-tests or Wilcoxon signed-rank tests with multiple-comparison correction). This will allow readers to assess effect sizes and rule out selection bias. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation of multi-agent TTS framework

full rationale

The paper describes an architectural framework consisting of an adversarial disentanglement agent, Dual-Stream Anchoring Controller with retrieval and synthesis agents, and Fast-Slow Feedback Agent. Effectiveness is asserted solely through experimental improvements on a composite-instruction benchmark and public test sets, with no mathematical derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central claim to its own inputs by construction. The description remains self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior self-work in a circular manner.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework implicitly assumes separability of acoustic subspaces and coverage of a prototype library, but these are not formalized.

pith-pipeline@v0.9.0 · 5742 in / 1214 out tokens · 30635 ms · 2026-05-20T21:11:14.973844+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · 10 internal anchors

  1. [1]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and and others , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

  2. [2]

    Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models , year=

    Li, Yanwei and Zhang, Yuechen and Wang, Chengyao and Zhong, Zhisheng and Chen, Yixin and Chu, Ruihang and Liu, Shaoteng and Jia, Jiaya , journal=. Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models , year=

  3. [3]

    2023 , volume =

    Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle =. 2023 , volume =

  4. [4]

    2025 , booktitle =

    Kang, Bin and Chen, Bin and Wang, Junjie and Li, Yulin and Zhao, Junzhi and Wang, Junle and Tian, Zhuotao , title =. 2025 , booktitle =

  5. [5]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , month =

    Zhang, Dong and Li, Shimin and Zhang, Xin and Zhan, Jun and Wang, Pengyu and Zhou, Yaqian and Qiu, Xipeng , title =. Findings of the Association for Computational Linguistics: EMNLP 2023 , month =. 2023 , pages =

  6. [8]

    and King, Irwin and others , title =

    Cui, Wenqian and Yu, Dianzhi and Jiao, Xiaoqi and Meng, Ziqiao and Zhang, Guangyan and Wang, Qichao and Guo, Steven Y. and King, Irwin and others , title =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month =. 2025 , pages =

  7. [9]

    Proceedings of the International Conference on Machine Learning , month =

    Ju, Zeqian and Wang, Yuancheng and Shen, Kai and He, Lei and Tan, Xu and Liu, Eric and Leng, Yichong and Zhao, Sheng and Qin, Tao and Bian, Jiang , title =. Proceedings of the International Conference on Machine Learning , month =. 2024 , pages =

  8. [10]

    , journal=

    Gao, Xiaoxue and Chen, Yiming and Yue, Xianghu and Tsao, Yu and Chen, Nancy F. , journal=. TTSlow: Slow Down Text-to-Speech With Efficiency Robustness Evaluations , year=

  9. [11]

    2025 , booktitle =

    Yang, Guanrou and Yang, Chen and Chen, Qian and Ma, Ziyang and Chen, Wenxi and Wang, Wen and Wang, Tianrui and others , title =. 2025 , booktitle =

  10. [12]

    Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , year=

    Chen, Sanyuan and Wang, Chengyi and Wu, Yu and Zhang, Ziqiang and Zhou, Long and Liu, Shujie and Chen, Zhuo and Liu, Yanqing and others , journal=. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , year=

  11. [13]

    2025 , eprint=

    InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems , author=. 2025 , eprint=

  12. [14]

    2025 , eprint=

    IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System , author=. 2025 , eprint=

  13. [15]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =

  14. [16]

    Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale , volume =

    Le, Matthew and Vyas, Apoorv and Shi, Bowen and Karrer, Brian and Sari, Leda and Moritz, Rashel and others , booktitle =. Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale , volume =

  15. [17]

    StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis , year=

    Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima , journal=. StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis , year=

  16. [18]

    IEEE Transactions on Neural Networks and Learning Systems , year =

    Diffsody: Disentangling Speaker-Invariant Prosody Representations via Diffusion Probabilistic Models , author =. IEEE Transactions on Neural Networks and Learning Systems , year =

  17. [19]

    Proceedings of Interspeech , year =

    DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech , author =. Proceedings of Interspeech , year =

  18. [20]

    Self-Refine: Iterative Refinement with Self-Feedback , volume =

    Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and others , booktitle =. Self-Refine: Iterative Refinement with Self-Feedback , volume =

  19. [21]

    Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset , year=

    Zhou, Kun and Sisman, Berrak and Liu, Rui and Li, Haizhou , booktitle=. Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset , year=

  20. [22]

    2025 , eprint=

    The MSP-Podcast Corpus , author=. 2025 , eprint=

  21. [23]

    Findings of the Association for Computational Linguistics: ACL 2024 , year =

    emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation , author =. Findings of the Association for Computational Linguistics: ACL 2024 , year =

  22. [24]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year =

    F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year =

  23. [25]

    2024 , eprint =

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens , author =. 2024 , eprint =

  24. [26]

    2024 , eprint =

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models , author =. 2024 , eprint =

  25. [27]

    2025 , eprint=

    Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens , author=. 2025 , eprint=

  26. [28]

    2025 , eprint =

    IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech , author =. 2025 , eprint =

  27. [29]

    International Conference on Learning Representations (ICLR) , year =

    ReAct: Synergizing Reasoning and Acting in Language Models , author =. International Conference on Learning Representations (ICLR) , year =

  28. [30]

    , author =

    Multi-agent reinforcement learning for resources allocation optimization: a survey. , author =. Artificial Intelligence Review , year =

  29. [31]

    Proceedings of the Thirteenth International Conference on Learning Representations , year =

    DoF: A Diffusion Factorization Framework for Offline Multi-Agent Reinforcement Learning , author =. Proceedings of the Thirteenth International Conference on Learning Representations , year =

  30. [32]

    and Cai, Tianhui and Huang, Zhiyu and Zhou, Bolei and Ma, Jiaqi , title =

    Zhou, Zewei and Zhao, Seth Z. and Cai, Tianhui and Huang, Zhiyu and Zhou, Bolei and Ma, Jiaqi , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

  31. [33]

    2025 , booktitle =

    Yang, Guanrou and Yang, Chen and Chen, Qian and Ma, Ziyang and Chen, Wenxi and Wang, Wen and Wang, Tianrui and Yang, Yifan and others , title =. 2025 , booktitle =

  32. [34]

    2025 , eprint=

    AlphaAgents: Large Language Model based Multi-Agents for Equity Portfolio Constructions , author=. 2025 , eprint=

  33. [35]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

  34. [37]

    Findings of the Association for Computational Linguistics: ACL 2025 , year =

    LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM , author =. Findings of the Association for Computational Linguistics: ACL 2025 , year =

  35. [38]

    , booktitle=

    Gao, Xiaoxue and Zhang, Chen and Chen, Yiming and Zhang, Huayun and Chen, Nancy F. , booktitle=. Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization , year=

  36. [39]

    2023 , eprint =

    Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , author =. 2023 , eprint =

  37. [40]

    2024 , eprint =

    MM-TTS: A Unified Framework for Multi-Modal Prompt-Based Emotional Text-to-Speech , author =. 2024 , eprint =

  38. [41]

    2024 , eprint =

    Qwen2-Audio Technical Report , author =. 2024 , eprint =

  39. [42]

    2024 , eprint =

    Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context , author =. 2024 , eprint =

  40. [43]

    2024 , howpublished =

    GPT-4o System Card , author =. 2024 , howpublished =

  41. [44]

    2025 , eprint =

    Qwen3 Technical Report , author =. 2025 , eprint =

  42. [45]

    2023 , eprint =

    Guiding FastSpeech2 Towards Emotional Text-to-Speech , author =. 2023 , eprint =

  43. [46]

    Applied Sciences , year =

    An Emotion Speech Synthesis Method Based on VITS , author =. Applied Sciences , year =

  44. [47]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    GenerSpeech: Towards Style Transfer for Generalizable Out-of-Domain Text-to-Speech , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  45. [48]

    Cognition & Emotion , volume =

    An Argument for Basic Emotions , author =. Cognition & Emotion , volume =

  46. [50]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    CAMEL: Communicative Agents for ``Mind'' Exploration of Large Language Model Society , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  47. [52]

    IEEE International Conference on Multimedia & Expo (ICME) , year =

    DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue , author =. IEEE International Conference on Multimedia & Expo (ICME) , year =

  48. [54]

    LongHorizon

    Kang, Bin and Wen, Shaoguo and Bi, Yifei and Wu, Shunlong and Yuan, Xinbin and Shao, Rui and Wang, Junle and Tian, Zhuotao , booktitle =. LongHorizon

  49. [55]

    Less is More, But Where? Dynamic Token Compression via

    Li, Yulin and Gui, Haokun and Fan, Ziyang and Wang, Junjie and Kang, Bin and Chen, Bin and Tian, Zhuotao , journal =. Less is More, But Where? Dynamic Token Compression via

  50. [56]

    Proceedings of the 14th International Conference on Learning Representations , year =

    Efficient Reasoning with Balanced Thinking , author =. Proceedings of the 14th International Conference on Learning Representations , year =

  51. [57]

    Wang, Junjie and Chen, Bin and Li, Yulin and Kang, Bin and Chen, Yichi and Tian, Zhuotao , booktitle =

  52. [58]

    Wang, Junjie and Chen, Bin and Kang, Bin and Li, Yulin and Xian, Weizhi and Chen, Yichi and Xu, Yong , booktitle =

  53. [60]

    Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models

    Li, Yanwei, Zhang, Yuechen, Wang, Chengyao, Zhong, Zhisheng, Chen, Yixin, Chu, Ruihang, Liu, Shaoteng, and Jia, Jiaya. Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models. IEEE Transactions on Pattern Analysis and Machine Intelligence. pages 1-14. 2025

  54. [61]

    BLIP -2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Li, Junnan, Li, Dongxu, Savarese, Silvio, and Hoi, Steven. BLIP -2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. Proceedings of the 40th International Conference on Machine Learning. 202, pages 19730--19742. 2023

  55. [62]

    CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval

    Kang, Bin, Chen, Bin, Wang, Junjie, Li, Yulin, Zhao, Junzhi, Wang, Junle, and Tian, Zhuotao. CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval. Proceedings of the 33rd ACM International Conference on Multimedia. pages 5140--5149. 2025

  56. [63]

    Less is More, But Where? Dynamic Token Compression via LLM -Guided Keyframe Prior

    Li, Yulin, Gui, Haokun, Fan, Ziyang, Wang, Junjie, Kang, Bin, Chen, Bin, and Tian, Zhuotao. Less is More, But Where? Dynamic Token Compression via LLM -Guided Keyframe Prior. Advances in Neural Information Processing Systems. 38, pages 156861--156904. 2026

  57. [64]

    Efficient Reasoning with Balanced Thinking

    Li, Yulin, Tu, Tengyao, Ding, Li, Wang, Junjie, Zhen, Huiling, Chen, Yixin, Yong, Li, and Tian, Zhuotao. Efficient Reasoning with Balanced Thinking. Proceedings of the 14th International Conference on Learning Representations. 2026

  58. [65]

    SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

    Zhang, Dong, Li, Shimin, Zhang, Xin, Zhan, Jun, Wang, Pengyu, Zhou, Yaqian, and Qiu, Xipeng. SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities. Findings of the Association for Computational Linguistics: EMNLP 2023. pages 15757-15773. 2023

  59. [66]

    AudioPaLM: A Large Language Model That Can Speak and Listen

    Rubenstein, Paul K., Asawaroengchai, Chulayuth, Nguyen, Duc Dung, Bapna, Ankur, Borsos, Zal \'a n, and others. AudioPaLM: A Large Language Model That Can Speak and Listen. arXiv preprint arXiv:2306.12925. 2023. arXiv:2306.12925

  60. [68]

    Recent Advances in Speech Language Models: A Survey

    Cui, Wenqian, Yu, Dianzhi, Jiao, Xiaoqi, Meng, Ziqiao, Zhang, Guangyan, Wang, Qichao, Guo, Steven Y., King, Irwin, and others. Recent Advances in Speech Language Models: A Survey. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pages 13943--13970. 2025

  61. [69]

    NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

    Ju, Zeqian, Wang, Yuancheng, Shen, Kai, He, Lei, Tan, Xu, Liu, Eric, Leng, Yichong, Zhao, Sheng, Qin, Tao, and Bian, Jiang. NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models. Proceedings of the International Conference on Machine Learning. pages 59545--59570. 2024

  62. [70]

    TTSlow: Slow Down Text-to-Speech With Efficiency Robustness Evaluations

    Gao, Xiaoxue, Chen, Yiming, Yue, Xianghu, Tsao, Yu, and Chen, Nancy F. TTSlow: Slow Down Text-to-Speech With Efficiency Robustness Evaluations. IEEE Transactions on Audio, Speech and Language Processing. 33, pages 693-704. 2025

  63. [71]

    Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

    Chen, Sanyuan, Wang, Chengyi, Wu, Yu, Zhang, Ziqiang, Zhou, Long, Liu, Shujie, Chen, Zhuo, Liu, Yanqing, and others. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 33, pages 705-718. 2025

  64. [72]

    InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

    Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, Shimin Li, Zhaoye Fei, Qinyuan Cheng, and Xipeng Qiu. InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems. Manuscript. 2025. arXiv:2506.16381

  65. [73]

    EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

    Yang, Guanrou, Yang, Chen, Chen, Qian, Ma, Ziyang, Chen, Wenxi, Wang, Wen, Wang, Tianrui, and others. EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting. Proceedings of the 33rd ACM International Conference on Multimedia. pages 10748--10757. 2025

  66. [74]

    Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,

    Zhou, Siyi, Zhou, Yiquan, He, Yi, Zhou, Xun, Wang, Jinchao, Deng, Wei, and Shu, Jingchen. IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech. Manuscript. 2025. arXiv:2506.21619

  67. [75]

    Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

    Kim, Jaehyeon, Kong, Jungil, and Son, Juhee. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. Proceedings of the 38th International Conference on Machine Learning. 139, pages 5530--5540. 2021

  68. [76]

    Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset

    Zhou, Kun, Sisman, Berrak, Liu, Rui, and Li, Haizhou. Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pages 920-924. 2021

  69. [77]

    Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

    Le, Matthew, Vyas, Apoorv, Shi, Bowen, Karrer, Brian, Sari, Leda, Moritz, Rashel, and others. Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale. Advances in Neural Information Processing Systems. 36, pages 14005--14034. 2023

  70. [78]

    Diffsody: Disentangling Speaker-Invariant Prosody Representations via Diffusion Probabilistic Models

    Qu, Leyuan, Zhang, Zhuo, Huang, Xiaoliang, Wen, Ping, Zhang, Rui, Wang, Wen, and others. Diffsody: Disentangling Speaker-Invariant Prosody Representations via Diffusion Probabilistic Models. IEEE Transactions on Neural Networks and Learning Systems. pages 1--12. 2025

  71. [79]

    DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech

    Cho, Deok-Hyeon, Oh, Hyung-Seok, Kim, Seung-Bin, and Lee, Seong-Whan. DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech. Proceedings of Interspeech. pages 4373--4377. 2025

  72. [80]

    An Argument for Basic Emotions

    Ekman, Paul. An Argument for Basic Emotions. Cognition & Emotion. 6(3--4), pages 169--200. 1992

  73. [81]

    An Emotion Speech Synthesis Method Based on VITS

    Zhao, Wei and Yang, Zheng. An Emotion Speech Synthesis Method Based on VITS. Applied Sciences. 13(4), pages 2225. 2023

  74. [82]

    Guiding FastSpeech2 Towards Emotional Text-to-Speech

    Ju, Zeqian and others. Guiding FastSpeech2 Towards Emotional Text-to-Speech. Manuscript. 2023. arXiv:2307.00024

  75. [83]

    GenerSpeech: Towards Style Transfer for Generalizable Out-of-Domain Text-to-Speech

    Zhao, Yue and others. GenerSpeech: Towards Style Transfer for Generalizable Out-of-Domain Text-to-Speech. Advances in Neural Information Processing Systems (NeurIPS). 2022

  76. [84]

    Self-Refine: Iterative Refinement with Self-Feedback

    Madaan, Aman, Tandon, Niket, Gupta, Prakhar, Hallinan, Skyler, Gao, Luyu, Wiegreffe, Sarah, Alon, Uri, and others. Self-Refine: Iterative Refinement with Self-Feedback. Advances in Neural Information Processing Systems. 36, pages 46534--46594. 2023

  77. [85]

    Salman, Wei-Cheng Lin, and others

    Carlos Busso, Reza Lotfian, Kusha Sridhar, Ali N. Salman, Wei-Cheng Lin, and others. The MSP-Podcast Corpus. Manuscript. 2025. arXiv:2509.09791

  78. [86]

    emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

    Ma, Ziyue, Zheng, Zhisheng, Ye, Jiaxin, Li, Jinchao, Gao, Zhifu, Zhang, ShiLiang, and others. emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation. Findings of the Association for Computational Linguistics: ACL 2024. pages 15747--15760. 2024

  79. [87]

    F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

    Chen, Yushen, Niu, Zhikang, Ma, Ziyang, Deng, Keqi, Wang, Chunhui, JianZhao, JianZhao, Yu, Kai, and Chen, Xie. F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. pages 6255--6271. 2025

  80. [88]

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Du, Zhihao, Chen, Qian, Zhang, Shiliang, Hu, Kai, Lu, Heng, Yang, Yexin, Hu, Hangrui, and others. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens. Manuscript. 2024. arXiv:2407.05407

Showing first 80 references.