AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech

Bin Kang; Junjie Wang; Junle Wang; Junzhi Zhao; Shaoguo Wen; Shunlong Wu; Yang Fan; Yulin Li; Zhuotao Tian

arxiv: 2605.17583 · v1 · pith:E625IQ2Wnew · submitted 2026-05-14 · 💻 cs.CV

AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech

Bin Kang , Shaoguo Wen , Yang Fan , Shunlong Wu , Junjie Wang , Yulin Li , Junzhi Zhao , Junle Wang

show 1 more author

Zhuotao Tian

This is my paper

Pith reviewed 2026-05-20 21:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-speechmulti-agent frameworkcomposite instructionsadversarial disentanglementacoustic prototypesclosed-loop controlexpressive speech synthesisintent control

0 comments

The pith

A multi-agent closed-loop framework separates speaker identity from emotion and anchors composite text intents to acoustic prototypes for more faithful TTS output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AgentSteerTTS to address the mismatch between discrete textual instructions and continuous speech features in text-to-speech systems. It deploys three cooperating agents in a feedback loop: one uses adversarial training to isolate speaker identity from emotion and prosody, another retrieves and blends examples from an acoustic library to match abstract intents, and a third corrects mismatches via gradient steps and perceptual review. A sympathetic reader would care because composite instructions like combining specific emotion with speaker traits often produce leakage or drift in current models. If the approach holds, generated speech would more reliably reflect detailed instructions without manual tuning of parameters.

Core claim

AgentSteerTTS is a multi-agent closed-loop framework for intent-faithful expressive control of composite instructions in TTS. An adversarial disentanglement agent mitigates speaker-emotion leakage by learning separable identity and emotion-prosody subspaces with leakage-suppressing regularization. A Dual-Stream Anchoring Controller grounds abstract intents using a large-scale acoustic prototype library, where a Retrieval Agent selects expressive anchors and a Synthesis Agent fuses them into continuous control vectors via gated attention. A Fast-Slow Feedback Agent refines output intensity through latent gradient correction and resolves semantic-acoustic mismatches using high-level perceptual

What carries the argument

The multi-agent closed-loop framework that combines an adversarial disentanglement agent for subspace separation, a dual-stream anchoring controller with retrieval and synthesis agents over an acoustic prototype library, and a fast-slow feedback agent for refinement.

If this is right

Yields consistent and significant improvements to baselines on a composite-instruction benchmark and public test sets.
Reduces speaker-emotion leakage through regularization in the disentanglement stage.
Improves grounding of abstract intents by fusing selected anchors from the acoustic prototype library.
Resolves semantic-acoustic mismatches via the fast-slow feedback mechanism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The closed-loop structure could support iterative refinement in live applications where a user corrects the output mid-generation.
Scaling the prototype library with more diverse recordings might extend coverage to rarer composite intents not seen in training.
Similar agent decomposition might apply to other conditional generation tasks such as controllable video synthesis from mixed instructions.

Load-bearing premise

The framework assumes an adversarial disentanglement agent can reliably learn separable identity and emotion-prosody subspaces without residual leakage and that the acoustic prototype library provides sufficient coverage for grounding arbitrary composite intents.

What would settle it

Run the composite-instruction benchmark with outputs evaluated for measurable leakage between speaker identity and emotion-prosody features; if leakage remains high or scores show no consistent gain over baselines, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.17583 by Bin Kang, Junjie Wang, Junle Wang, Junzhi Zhao, Shaoguo Wen, Shunlong Wu, Yang Fan, Yulin Li, Zhuotao Tian.

**Figure 1.** Figure 1: Illustration of Semantic-Acoustic Misalignment. The red dashed line delineates the target emotion, while the colored regions represent the generated speech. Despite high-level control instructions, the model exhibits under-expression along the target emotion dimension, while unintentionally leaking acoustic energy into irrelevant emotional dimensions. 1. Introduction Recent advances in speech and multimoda… view at source ↗

**Figure 2.** Figure 2: Emotion expression bias. Target emotion dimensions are suppressed, while non-target dimensions exhibit consistent positive leakage across conditions. et al., 2025; Ju et al., 2024; Gao et al., 2025) that casts expressive TTS as a feed-forward mapping from text (optionally augmented with style prompts) to waveforms. Many such methods rely on fine-grained supervision, such as dense prosody annotations or me… view at source ↗

**Figure 3.** Figure 3: Speaker–emotion entanglement: Composite-intent fidelity (C-SIM) is negatively correlated with the speaker-fidelity proxy (S-SIM). tally fail to reliably control continuous acoustic realizations for fine-grained composite instructions (e.g., "Happy but slightly Arrogant"). 2.1. Why do deterministic mappings fail for composite instructions control? We model speech generation as a stochastic process jointly … view at source ↗

**Figure 4.** Figure 4: Overview of the AGENTSTEERTTS architecture. (a) an Adversarial Disentanglement Module that utilizes gradient reversal layers to construct orthogonal speaker and emotion subspaces; (b) a Dual-Stream Anchoring Controller that aligns latent features with retrieved acoustic prototypes via consistency calibration; and (c) a Fast-Slow Feedback Loop that dynamically refines the generation process during inference… view at source ↗

**Figure 5.** Figure 5: Speaker–emotion entanglement before/after ADM disentanglement. ADM reduces speaker drift while maintaining emotion alignment, mitigating the identity–emotion trade-off. off: improving composite alignment often comes with larger speaker drift. After ADM, the distribution shifts toward lower drift while largely preserving emotion alignment, providing direct experimental evidence that disentanglement stabil… view at source ↗

**Figure 6.** Figure 6: Sensitivity of confidence-gated fusion β(δ). Sensitivity to Retrieval Confidence [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Fast Agent efficiency–effectiveness trade-off. Fast-Loop Efficiency–Effectiveness Trade-off [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Spectral and prosodic evidence of compositional control. Under “Happy but slightly Arrogant”, retains Happy’s high-frequency energy and Arrogant’s restrained resonance, with matched F0/energy contours. AGENTSTEERTTS explicitly addresses two key challenges: (1) the structural mismatch between discrete textual intents and continuous acoustic realizations, and (2) entanglement between speaker identity and pro… view at source ↗

**Figure 9.** Figure 9: Composite-instruction radar plots in the 6-D emotion space. We overlay the target composite vector with the mean extracted emotion vector from multiple generations, highlighting target-dimension suppression and non-target leakage under discrete prompting. A.2. Composite Feasible Region under Composite Instructions To probe whether composite-control failures reflect distribution-level collapse, we visualize… view at source ↗

**Figure 10.** Figure 10: Composite feasible regions across six composite emotion instructions. Compared with Text-only (neutral collapse) and Retrieval-only (scattered modes), our full system shifts and concentrates the output distribution into the target region, consistent with reduced collapse under composite control. α to [0, 1] and run T=2 Fast-Loop gradient steps with γ=5 × 10−3 unless stated otherwise. The Slow Loop leverag… view at source ↗

**Figure 11.** Figure 11: Speaker–emotion entanglement before/after ADM disentanglement on 500 labeled utterances. We show a four-panel t-SNE layout: zid colored by speaker/emotion and zemo colored by emotion/speaker. After ADM, speaker-driven separation in zemo weakens while emotion structure remains more salient, consistent with reduced identity leakage into the emotion space [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Attribute energy allocation under a composite instruction (“Happy but slightly Arrogant”). Compared to the baseline, AGENTSTEERTTS concentrates energy on the target dimensions (higher target mass, lower leakage) and improves temporal stability. the baseline (w/o disentanglement) exhibits typical attribute entanglement failure modes: target emotion cues are partially suppressed while non-target spectral re… view at source ↗

**Figure 13.** Figure 13: Baseline vs. AGENTSTEERTTS under Happy+Arrogant: our method produces more localized, intent-aligned spectral and prosodic changes, while the baseline shows diluted or inconsistent edits. 0 0.5 1 1.5 2 2.5 Time (s) 0 512 1024 2048 4096 8192 Frequency (Hz) (a) Baseline (w/o Disentanglement) 0 0.5 1 1.5 2 2.5 Time (s) 0 512 1024 2048 4096 8192 (b) Ours (with ADM Disentanglement) 80 70 60 50 40 30 20 10 0 Mag… view at source ↗

**Figure 14.** Figure 14: Baseline vs. AGENTSTEERTTS under Sad+Hopeful: our disentangled control reduces attribute leakage and yields more stable prosody under composition. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: Baseline vs. AGENTSTEERTTS under Angry+Restrained: our method preserves speaker-related structure while applying targeted emotion-relevant modifications. 0 0.15 0.3 0.45 0.6 0.75 0.9 1.1 1.2 Time (s) 0 512 1024 2048 4096 8192 Frequency (Hz) (a) Baseline (w/o Disentanglement) 0 0.15 0.3 0.45 0.6 0.75 0.9 1.1 1.2 Time (s) 0 512 1024 2048 4096 8192 (b) Ours (with ADM Disentanglement) 80 70 60 50 40 30 20 10 … view at source ↗

**Figure 16.** Figure 16: Baseline vs. AGENTSTEERTTS under Surprised+Fearful: our method improves compositional fidelity by avoiding oversmoothed or globally perturbed patterns. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

**Figure 17.** Figure 17: Compositional control on Angry+Restrained; localized spectral edits and consistent prosody are confirmed by (g) and composability scores in (h). 0 0.5 1 1.5 2 2.5 3 3.5 4 Time (s) 0 512 1024 2048 4096 8192 Frequency (Hz) (a) Base (Neutral) 0 0.5 1 1.5 2 2.5 3 3.5 4 Time (s) 0 512 1024 2048 4096 8192 (b) Single: Angry 0 0.5 1 1.5 2 2.5 3 3.5 4 Time (s) 0 512 1024 2048 4096 8192 (c) Single: Calm 0 0.5 1 1.5… view at source ↗

**Figure 18.** Figure 18: Compositional control on Angry+Restrained; localized spectral edits and consistent prosody are confirmed by (g) and composability scores in (h). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗

**Figure 19.** Figure 19: Generalization on Surprised+Fearful: the composite preserves non-trivial spectral/prosodic cues, supported by localized edits in (g) and scores in (h). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗

read the original abstract

While existing text-to-speech (TTS) models exhibit high expressiveness, fine-grained control over composite instructions remains challenging due to the structural mismatch between discrete textual intents and continuous acoustic realizations. Inspired by human cognitive decoupling, we introduce AgentSteerTTS, a multi-agent closed-loop framework designed for intent-faithful expressive control of composite instructions. First, in our framework, an adversarial disentanglement agent mitigates speaker-emotion leakage by learning separable identity and emotion-prosody subspaces with leakage-suppressing regularization. Next, a Dual-Stream Anchoring Controller grounds abstract intents using a large-scale acoustic prototype library: a Retrieval Agent selects expressive anchors, while a Synthesis Agent fuses them into continuous control vectors via gated attention. Finally, a Fast-Slow Feedback Agent refines output intensity through latent gradient correction and resolves semantic-acoustic mismatches using high-level perceptual critique. Experiments on a composite-instruction benchmark and public test sets show that AgentSteerTTS yields consistent and significant improvements to the baselines, demonstrating the effectiveness of the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentSteerTTS outlines a multi-agent loop with disentanglement and prototype anchoring for composite TTS instructions, but the abstract leaves the claimed gains and leakage control unverified.

read the letter

The main point is that this paper puts forward a multi-agent closed-loop system to handle composite instructions in TTS by separating speaker identity from emotion-prosody and then grounding those factors through retrieval and feedback agents. The architecture includes an adversarial disentanglement step with regularization, a Dual-Stream Anchoring Controller that pulls from a large acoustic prototype library, and a Fast-Slow Feedback Agent for refinement. That combination is the clearest new element relative to standard controllable TTS work. It does a reasonable job of framing the structural mismatch between text intents and acoustic output as a decoupling problem and then proposing concrete agents to address it, which could appeal to people building practical control systems. The prototype library and gated attention fusion look like sensible engineering choices for making abstract intents more concrete. The soft spots sit mainly in the experimental grounding. The abstract states consistent improvements on a composite benchmark and public sets, yet supplies no numbers, baselines, or ablations, so it is hard to tell whether the gains come from the disentanglement or from other unmentioned factors. The stress-test concern about residual leakage is reasonable to raise: the regularization is described, but without reported correlation metrics or subspace separation checks, it is unclear whether the identity and emotion-prosody spaces are clean enough for independent control in mixed-instruction cases. The assumption that the prototype library covers arbitrary intents also goes untested in the visible summary. This paper is aimed at TTS researchers who already work on fine-grained control or multi-agent generation methods. A reader who wants to see how agent loops can be applied to speech synthesis might extract useful design ideas even if the results need more scrutiny. It is worth sending to peer review so that the implementation details and quantitative checks can be examined properly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AgentSteerTTS, a multi-agent closed-loop framework for composite-instruction text-to-speech synthesis. It first employs an adversarial disentanglement agent to learn separable identity and emotion-prosody subspaces via leakage-suppressing regularization. A Dual-Stream Anchoring Controller then grounds abstract intents by retrieving expressive anchors from a large-scale acoustic prototype library and fusing them into control vectors with gated attention. A Fast-Slow Feedback Agent refines intensity via latent gradient correction and resolves mismatches with perceptual critique. Experiments on a composite-instruction benchmark and public test sets are reported to yield consistent and significant improvements over baselines.

Significance. If the disentanglement produces cleanly separable subspaces and the prototype library provides sufficient coverage, the framework could meaningfully advance fine-grained, intent-faithful control in expressive TTS, addressing the discrete-to-continuous mismatch that limits current models. The closed-loop multi-agent structure and explicit anchoring mechanism represent a distinct architectural choice that, if validated, may generalize to other conditional generation tasks.

major comments (2)

[Abstract and §3.1 (adversarial disentanglement agent)] The central claim of consistent improvements rests on the adversarial disentanglement agent producing cleanly separable subspaces. The abstract and method description mention leakage-suppressing regularization, yet no quantitative verification (e.g., correlation coefficients, mutual information, or orthogonality metrics between identity and emotion-prosody embeddings) is provided to confirm residual leakage is negligible in the regimes required for composite instructions. If leakage persists, the Dual-Stream Anchoring Controller cannot ground intents independently, undermining the reported gains.
[§4 (Experiments)] The experimental results assert 'consistent and significant improvements' on a composite-instruction benchmark and public test sets, but the manuscript provides no tabulated metrics, baseline comparisons, error bars, or statistical tests in the visible description. This absence makes it impossible to evaluate effect sizes or rule out post-hoc selection effects.

minor comments (2)

[§3.2] Clarify the precise definition and training objective of the gated attention fusion in the Synthesis Agent, including any hyper-parameters that control the balance between retrieved anchors.
[§3.2] Provide details on the scale, construction, and coverage statistics of the large-scale acoustic prototype library to substantiate the claim that it supports arbitrary composite intents.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below, indicating the specific revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §3.1 (adversarial disentanglement agent)] The central claim of consistent improvements rests on the adversarial disentanglement agent producing cleanly separable subspaces. The abstract and method description mention leakage-suppressing regularization, yet no quantitative verification (e.g., correlation coefficients, mutual information, or orthogonality metrics between identity and emotion-prosody embeddings) is provided to confirm residual leakage is negligible in the regimes required for composite instructions. If leakage persists, the Dual-Stream Anchoring Controller cannot ground intents independently, undermining the reported gains.

Authors: We agree that quantitative verification of subspace separability is essential to support the claims. In the revised manuscript we will add explicit metrics, including Pearson correlation coefficients, mutual information estimates, and orthogonality measures (e.g., cosine similarity or Gram-matrix off-diagonal norms) computed between the identity and emotion-prosody embeddings on both training and held-out composite-instruction data. These additions will directly demonstrate that residual leakage is negligible under the conditions used for the reported experiments. revision: yes
Referee: [§4 (Experiments)] The experimental results assert 'consistent and significant improvements' on a composite-instruction benchmark and public test sets, but the manuscript provides no tabulated metrics, baseline comparisons, error bars, or statistical tests in the visible description. This absence makes it impossible to evaluate effect sizes or rule out post-hoc selection effects.

Authors: We acknowledge that the current presentation lacks sufficient tabular detail for independent evaluation. In the revised version we will include comprehensive result tables that report all objective and subjective metrics for AgentSteerTTS and every baseline, together with standard deviations or confidence intervals, and p-values from appropriate statistical tests (e.g., paired t-tests or Wilcoxon signed-rank tests with multiple-comparison correction). This will allow readers to assess effect sizes and rule out selection bias. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation of multi-agent TTS framework

full rationale

The paper describes an architectural framework consisting of an adversarial disentanglement agent, Dual-Stream Anchoring Controller with retrieval and synthesis agents, and Fast-Slow Feedback Agent. Effectiveness is asserted solely through experimental improvements on a composite-instruction benchmark and public test sets, with no mathematical derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central claim to its own inputs by construction. The description remains self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior self-work in a circular manner.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework implicitly assumes separability of acoustic subspaces and coverage of a prototype library, but these are not formalized.

pith-pipeline@v0.9.0 · 5742 in / 1214 out tokens · 30635 ms · 2026-05-20T21:11:14.973844+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · 10 internal anchors

[1]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and and others , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

work page 2024
[2]

Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models , year=

Li, Yanwei and Zhang, Yuechen and Wang, Chengyao and Zhong, Zhisheng and Chen, Yixin and Chu, Ruihang and Liu, Shaoteng and Jia, Jiaya , journal=. Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models , year=

work page
[3]

2023 , volume =

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle =. 2023 , volume =

work page 2023
[4]

2025 , booktitle =

Kang, Bin and Chen, Bin and Wang, Junjie and Li, Yulin and Zhao, Junzhi and Wang, Junle and Tian, Zhuotao , title =. 2025 , booktitle =

work page 2025
[5]

Findings of the Association for Computational Linguistics: EMNLP 2023 , month =

Zhang, Dong and Li, Shimin and Zhang, Xin and Zhan, Jun and Wang, Pengyu and Zhou, Yaqian and Qiu, Xipeng , title =. Findings of the Association for Computational Linguistics: EMNLP 2023 , month =. 2023 , pages =

work page 2023
[8]

and King, Irwin and others , title =

Cui, Wenqian and Yu, Dianzhi and Jiao, Xiaoqi and Meng, Ziqiao and Zhang, Guangyan and Wang, Qichao and Guo, Steven Y. and King, Irwin and others , title =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month =. 2025 , pages =

work page 2025
[9]

Proceedings of the International Conference on Machine Learning , month =

Ju, Zeqian and Wang, Yuancheng and Shen, Kai and He, Lei and Tan, Xu and Liu, Eric and Leng, Yichong and Zhao, Sheng and Qin, Tao and Bian, Jiang , title =. Proceedings of the International Conference on Machine Learning , month =. 2024 , pages =

work page 2024
[10]

, journal=

Gao, Xiaoxue and Chen, Yiming and Yue, Xianghu and Tsao, Yu and Chen, Nancy F. , journal=. TTSlow: Slow Down Text-to-Speech With Efficiency Robustness Evaluations , year=

work page
[11]

2025 , booktitle =

Yang, Guanrou and Yang, Chen and Chen, Qian and Ma, Ziyang and Chen, Wenxi and Wang, Wen and Wang, Tianrui and others , title =. 2025 , booktitle =

work page 2025
[12]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , year=

Chen, Sanyuan and Wang, Chengyi and Wu, Yu and Zhang, Ziqiang and Zhou, Long and Liu, Shujie and Chen, Zhuo and Liu, Yanqing and others , journal=. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , year=

work page
[13]

2025 , eprint=

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems , author=. 2025 , eprint=

work page 2025
[14]

2025 , eprint=

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System , author=. 2025 , eprint=

work page 2025
[15]

Proceedings of the 38th International Conference on Machine Learning , pages =

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =

work page 2021
[16]

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale , volume =

Le, Matthew and Vyas, Apoorv and Shi, Bowen and Karrer, Brian and Sari, Leda and Moritz, Rashel and others , booktitle =. Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale , volume =

work page
[17]

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis , year=

Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima , journal=. StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis , year=

work page
[18]

IEEE Transactions on Neural Networks and Learning Systems , year =

Diffsody: Disentangling Speaker-Invariant Prosody Representations via Diffusion Probabilistic Models , author =. IEEE Transactions on Neural Networks and Learning Systems , year =

work page
[19]

Proceedings of Interspeech , year =

DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech , author =. Proceedings of Interspeech , year =

work page
[20]

Self-Refine: Iterative Refinement with Self-Feedback , volume =

Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and others , booktitle =. Self-Refine: Iterative Refinement with Self-Feedback , volume =

work page
[21]

Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset , year=

Zhou, Kun and Sisman, Berrak and Liu, Rui and Li, Haizhou , booktitle=. Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset , year=

work page
[22]

2025 , eprint=

The MSP-Podcast Corpus , author=. 2025 , eprint=

work page 2025
[23]

Findings of the Association for Computational Linguistics: ACL 2024 , year =

emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation , author =. Findings of the Association for Computational Linguistics: ACL 2024 , year =

work page 2024
[24]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year =

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year =

work page
[25]

2024 , eprint =

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens , author =. 2024 , eprint =

work page 2024
[26]

2024 , eprint =

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models , author =. 2024 , eprint =

work page 2024
[27]

2025 , eprint=

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens , author=. 2025 , eprint=

work page 2025
[28]

2025 , eprint =

IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech , author =. 2025 , eprint =

work page 2025
[29]

International Conference on Learning Representations (ICLR) , year =

ReAct: Synergizing Reasoning and Acting in Language Models , author =. International Conference on Learning Representations (ICLR) , year =

work page
[30]

, author =

Multi-agent reinforcement learning for resources allocation optimization: a survey. , author =. Artificial Intelligence Review , year =

work page
[31]

Proceedings of the Thirteenth International Conference on Learning Representations , year =

DoF: A Diffusion Factorization Framework for Offline Multi-Agent Reinforcement Learning , author =. Proceedings of the Thirteenth International Conference on Learning Representations , year =

work page
[32]

and Cai, Tianhui and Huang, Zhiyu and Zhou, Bolei and Ma, Jiaqi , title =

Zhou, Zewei and Zhao, Seth Z. and Cai, Tianhui and Huang, Zhiyu and Zhou, Bolei and Ma, Jiaqi , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

work page 2025
[33]

2025 , booktitle =

Yang, Guanrou and Yang, Chen and Chen, Qian and Ma, Ziyang and Chen, Wenxi and Wang, Wen and Wang, Tianrui and Yang, Yifan and others , title =. 2025 , booktitle =

work page 2025
[34]

2025 , eprint=

AlphaAgents: Large Language Model based Multi-Agents for Equity Portfolio Constructions , author=. 2025 , eprint=

work page 2025
[35]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

work page 2025
[37]

Findings of the Association for Computational Linguistics: ACL 2025 , year =

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM , author =. Findings of the Association for Computational Linguistics: ACL 2025 , year =

work page 2025
[38]

, booktitle=

Gao, Xiaoxue and Zhang, Chen and Chen, Yiming and Zhang, Huayun and Chen, Nancy F. , booktitle=. Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization , year=

work page
[39]

2023 , eprint =

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , author =. 2023 , eprint =

work page 2023
[40]

2024 , eprint =

MM-TTS: A Unified Framework for Multi-Modal Prompt-Based Emotional Text-to-Speech , author =. 2024 , eprint =

work page 2024
[41]

2024 , eprint =

Qwen2-Audio Technical Report , author =. 2024 , eprint =

work page 2024
[42]

2024 , eprint =

Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context , author =. 2024 , eprint =

work page 2024
[43]

2024 , howpublished =

GPT-4o System Card , author =. 2024 , howpublished =

work page 2024
[44]

2025 , eprint =

Qwen3 Technical Report , author =. 2025 , eprint =

work page 2025
[45]

2023 , eprint =

Guiding FastSpeech2 Towards Emotional Text-to-Speech , author =. 2023 , eprint =

work page 2023
[46]

Applied Sciences , year =

An Emotion Speech Synthesis Method Based on VITS , author =. Applied Sciences , year =

work page
[47]

Advances in Neural Information Processing Systems (NeurIPS) , year =

GenerSpeech: Towards Style Transfer for Generalizable Out-of-Domain Text-to-Speech , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[48]

Cognition & Emotion , volume =

An Argument for Basic Emotions , author =. Cognition & Emotion , volume =

work page
[50]

Advances in Neural Information Processing Systems (NeurIPS) , year =

CAMEL: Communicative Agents for ``Mind'' Exploration of Large Language Model Society , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[52]

IEEE International Conference on Multimedia & Expo (ICME) , year =

DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue , author =. IEEE International Conference on Multimedia & Expo (ICME) , year =

work page
[54]

LongHorizon

Kang, Bin and Wen, Shaoguo and Bi, Yifei and Wu, Shunlong and Yuan, Xinbin and Shao, Rui and Wang, Junle and Tian, Zhuotao , booktitle =. LongHorizon

work page
[55]

Less is More, But Where? Dynamic Token Compression via

Li, Yulin and Gui, Haokun and Fan, Ziyang and Wang, Junjie and Kang, Bin and Chen, Bin and Tian, Zhuotao , journal =. Less is More, But Where? Dynamic Token Compression via

work page
[56]

Proceedings of the 14th International Conference on Learning Representations , year =

Efficient Reasoning with Balanced Thinking , author =. Proceedings of the 14th International Conference on Learning Representations , year =

work page
[57]

Wang, Junjie and Chen, Bin and Li, Yulin and Kang, Bin and Chen, Yichi and Tian, Zhuotao , booktitle =

work page
[58]

Wang, Junjie and Chen, Bin and Kang, Bin and Li, Yulin and Xian, Weizhi and Chen, Yichi and Xu, Yong , booktitle =

work page
[60]

Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models

Li, Yanwei, Zhang, Yuechen, Wang, Chengyao, Zhong, Zhisheng, Chen, Yixin, Chu, Ruihang, Liu, Shaoteng, and Jia, Jiaya. Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models. IEEE Transactions on Pattern Analysis and Machine Intelligence. pages 1-14. 2025

work page 2025
[61]

BLIP -2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Li, Junnan, Li, Dongxu, Savarese, Silvio, and Hoi, Steven. BLIP -2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. Proceedings of the 40th International Conference on Machine Learning. 202, pages 19730--19742. 2023

work page 2023
[62]

CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval

Kang, Bin, Chen, Bin, Wang, Junjie, Li, Yulin, Zhao, Junzhi, Wang, Junle, and Tian, Zhuotao. CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval. Proceedings of the 33rd ACM International Conference on Multimedia. pages 5140--5149. 2025

work page 2025
[63]

Less is More, But Where? Dynamic Token Compression via LLM -Guided Keyframe Prior

Li, Yulin, Gui, Haokun, Fan, Ziyang, Wang, Junjie, Kang, Bin, Chen, Bin, and Tian, Zhuotao. Less is More, But Where? Dynamic Token Compression via LLM -Guided Keyframe Prior. Advances in Neural Information Processing Systems. 38, pages 156861--156904. 2026

work page 2026
[64]

Efficient Reasoning with Balanced Thinking

Li, Yulin, Tu, Tengyao, Ding, Li, Wang, Junjie, Zhen, Huiling, Chen, Yixin, Yong, Li, and Tian, Zhuotao. Efficient Reasoning with Balanced Thinking. Proceedings of the 14th International Conference on Learning Representations. 2026

work page 2026
[65]

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

Zhang, Dong, Li, Shimin, Zhang, Xin, Zhan, Jun, Wang, Pengyu, Zhou, Yaqian, and Qiu, Xipeng. SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities. Findings of the Association for Computational Linguistics: EMNLP 2023. pages 15757-15773. 2023

work page 2023
[66]

AudioPaLM: A Large Language Model That Can Speak and Listen

Rubenstein, Paul K., Asawaroengchai, Chulayuth, Nguyen, Duc Dung, Bapna, Ankur, Borsos, Zal \'a n, and others. AudioPaLM: A Large Language Model That Can Speak and Listen. arXiv preprint arXiv:2306.12925. 2023. arXiv:2306.12925

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Recent Advances in Speech Language Models: A Survey

Cui, Wenqian, Yu, Dianzhi, Jiao, Xiaoqi, Meng, Ziqiao, Zhang, Guangyan, Wang, Qichao, Guo, Steven Y., King, Irwin, and others. Recent Advances in Speech Language Models: A Survey. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pages 13943--13970. 2025

work page 2025
[69]

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Ju, Zeqian, Wang, Yuancheng, Shen, Kai, He, Lei, Tan, Xu, Liu, Eric, Leng, Yichong, Zhao, Sheng, Qin, Tao, and Bian, Jiang. NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models. Proceedings of the International Conference on Machine Learning. pages 59545--59570. 2024

work page 2024
[70]

TTSlow: Slow Down Text-to-Speech With Efficiency Robustness Evaluations

Gao, Xiaoxue, Chen, Yiming, Yue, Xianghu, Tsao, Yu, and Chen, Nancy F. TTSlow: Slow Down Text-to-Speech With Efficiency Robustness Evaluations. IEEE Transactions on Audio, Speech and Language Processing. 33, pages 693-704. 2025

work page 2025
[71]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chen, Sanyuan, Wang, Chengyi, Wu, Yu, Zhang, Ziqiang, Zhou, Long, Liu, Shujie, Chen, Zhuo, Liu, Yanqing, and others. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 33, pages 705-718. 2025

work page 2025
[72]

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, Shimin Li, Zhaoye Fei, Qinyuan Cheng, and Xipeng Qiu. InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems. Manuscript. 2025. arXiv:2506.16381

work page arXiv 2025
[73]

EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

Yang, Guanrou, Yang, Chen, Chen, Qian, Ma, Ziyang, Chen, Wenxi, Wang, Wen, Wang, Tianrui, and others. EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting. Proceedings of the 33rd ACM International Conference on Multimedia. pages 10748--10757. 2025

work page 2025
[74]

Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,

Zhou, Siyi, Zhou, Yiquan, He, Yi, Zhou, Xun, Wang, Jinchao, Deng, Wei, and Shu, Jingchen. IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech. Manuscript. 2025. arXiv:2506.21619

work page arXiv 2025
[75]

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Kim, Jaehyeon, Kong, Jungil, and Son, Juhee. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. Proceedings of the 38th International Conference on Machine Learning. 139, pages 5530--5540. 2021

work page 2021
[76]

Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset

Zhou, Kun, Sisman, Berrak, Liu, Rui, and Li, Haizhou. Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pages 920-924. 2021

work page 2021
[77]

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

Le, Matthew, Vyas, Apoorv, Shi, Bowen, Karrer, Brian, Sari, Leda, Moritz, Rashel, and others. Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale. Advances in Neural Information Processing Systems. 36, pages 14005--14034. 2023

work page 2023
[78]

Diffsody: Disentangling Speaker-Invariant Prosody Representations via Diffusion Probabilistic Models

Qu, Leyuan, Zhang, Zhuo, Huang, Xiaoliang, Wen, Ping, Zhang, Rui, Wang, Wen, and others. Diffsody: Disentangling Speaker-Invariant Prosody Representations via Diffusion Probabilistic Models. IEEE Transactions on Neural Networks and Learning Systems. pages 1--12. 2025

work page 2025
[79]

DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech

Cho, Deok-Hyeon, Oh, Hyung-Seok, Kim, Seung-Bin, and Lee, Seong-Whan. DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech. Proceedings of Interspeech. pages 4373--4377. 2025

work page 2025
[80]

An Argument for Basic Emotions

Ekman, Paul. An Argument for Basic Emotions. Cognition & Emotion. 6(3--4), pages 169--200. 1992

work page 1992
[81]

An Emotion Speech Synthesis Method Based on VITS

Zhao, Wei and Yang, Zheng. An Emotion Speech Synthesis Method Based on VITS. Applied Sciences. 13(4), pages 2225. 2023

work page 2023
[82]

Guiding FastSpeech2 Towards Emotional Text-to-Speech

Ju, Zeqian and others. Guiding FastSpeech2 Towards Emotional Text-to-Speech. Manuscript. 2023. arXiv:2307.00024

work page arXiv 2023
[83]

GenerSpeech: Towards Style Transfer for Generalizable Out-of-Domain Text-to-Speech

Zhao, Yue and others. GenerSpeech: Towards Style Transfer for Generalizable Out-of-Domain Text-to-Speech. Advances in Neural Information Processing Systems (NeurIPS). 2022

work page 2022
[84]

Self-Refine: Iterative Refinement with Self-Feedback

Madaan, Aman, Tandon, Niket, Gupta, Prakhar, Hallinan, Skyler, Gao, Luyu, Wiegreffe, Sarah, Alon, Uri, and others. Self-Refine: Iterative Refinement with Self-Feedback. Advances in Neural Information Processing Systems. 36, pages 46534--46594. 2023

work page 2023
[85]

Salman, Wei-Cheng Lin, and others

Carlos Busso, Reza Lotfian, Kusha Sridhar, Ali N. Salman, Wei-Cheng Lin, and others. The MSP-Podcast Corpus. Manuscript. 2025. arXiv:2509.09791

work page arXiv 2025
[86]

emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

Ma, Ziyue, Zheng, Zhisheng, Ye, Jiaxin, Li, Jinchao, Gao, Zhifu, Zhang, ShiLiang, and others. emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation. Findings of the Association for Computational Linguistics: ACL 2024. pages 15747--15760. 2024

work page 2024
[87]

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Chen, Yushen, Niu, Zhikang, Ma, Ziyang, Deng, Keqi, Wang, Chunhui, JianZhao, JianZhao, Yu, Kai, and Chen, Xie. F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. pages 6255--6271. 2025

work page 2025
[88]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Du, Zhihao, Chen, Qian, Zhang, Shiliang, Hu, Kai, Lu, Heng, Yang, Yexin, Hu, Hangrui, and others. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens. Manuscript. 2024. arXiv:2407.05407

work page internal anchor Pith review Pith/arXiv arXiv 2024

Showing first 80 references.

[1] [1]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and and others , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

work page 2024

[2] [2]

Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models , year=

Li, Yanwei and Zhang, Yuechen and Wang, Chengyao and Zhong, Zhisheng and Chen, Yixin and Chu, Ruihang and Liu, Shaoteng and Jia, Jiaya , journal=. Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models , year=

work page

[3] [3]

2023 , volume =

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle =. 2023 , volume =

work page 2023

[4] [4]

2025 , booktitle =

Kang, Bin and Chen, Bin and Wang, Junjie and Li, Yulin and Zhao, Junzhi and Wang, Junle and Tian, Zhuotao , title =. 2025 , booktitle =

work page 2025

[5] [5]

Findings of the Association for Computational Linguistics: EMNLP 2023 , month =

Zhang, Dong and Li, Shimin and Zhang, Xin and Zhan, Jun and Wang, Pengyu and Zhou, Yaqian and Qiu, Xipeng , title =. Findings of the Association for Computational Linguistics: EMNLP 2023 , month =. 2023 , pages =

work page 2023

[6] [8]

and King, Irwin and others , title =

Cui, Wenqian and Yu, Dianzhi and Jiao, Xiaoqi and Meng, Ziqiao and Zhang, Guangyan and Wang, Qichao and Guo, Steven Y. and King, Irwin and others , title =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month =. 2025 , pages =

work page 2025

[7] [9]

Proceedings of the International Conference on Machine Learning , month =

Ju, Zeqian and Wang, Yuancheng and Shen, Kai and He, Lei and Tan, Xu and Liu, Eric and Leng, Yichong and Zhao, Sheng and Qin, Tao and Bian, Jiang , title =. Proceedings of the International Conference on Machine Learning , month =. 2024 , pages =

work page 2024

[8] [10]

, journal=

Gao, Xiaoxue and Chen, Yiming and Yue, Xianghu and Tsao, Yu and Chen, Nancy F. , journal=. TTSlow: Slow Down Text-to-Speech With Efficiency Robustness Evaluations , year=

work page

[9] [11]

2025 , booktitle =

Yang, Guanrou and Yang, Chen and Chen, Qian and Ma, Ziyang and Chen, Wenxi and Wang, Wen and Wang, Tianrui and others , title =. 2025 , booktitle =

work page 2025

[10] [12]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , year=

Chen, Sanyuan and Wang, Chengyi and Wu, Yu and Zhang, Ziqiang and Zhou, Long and Liu, Shujie and Chen, Zhuo and Liu, Yanqing and others , journal=. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , year=

work page

[11] [13]

2025 , eprint=

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems , author=. 2025 , eprint=

work page 2025

[12] [14]

2025 , eprint=

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System , author=. 2025 , eprint=

work page 2025

[13] [15]

Proceedings of the 38th International Conference on Machine Learning , pages =

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =

work page 2021

[14] [16]

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale , volume =

Le, Matthew and Vyas, Apoorv and Shi, Bowen and Karrer, Brian and Sari, Leda and Moritz, Rashel and others , booktitle =. Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale , volume =

work page

[15] [17]

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis , year=

Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima , journal=. StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis , year=

work page

[16] [18]

IEEE Transactions on Neural Networks and Learning Systems , year =

Diffsody: Disentangling Speaker-Invariant Prosody Representations via Diffusion Probabilistic Models , author =. IEEE Transactions on Neural Networks and Learning Systems , year =

work page

[17] [19]

Proceedings of Interspeech , year =

DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech , author =. Proceedings of Interspeech , year =

work page

[18] [20]

Self-Refine: Iterative Refinement with Self-Feedback , volume =

Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and others , booktitle =. Self-Refine: Iterative Refinement with Self-Feedback , volume =

work page

[19] [21]

Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset , year=

Zhou, Kun and Sisman, Berrak and Liu, Rui and Li, Haizhou , booktitle=. Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset , year=

work page

[20] [22]

2025 , eprint=

The MSP-Podcast Corpus , author=. 2025 , eprint=

work page 2025

[21] [23]

Findings of the Association for Computational Linguistics: ACL 2024 , year =

emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation , author =. Findings of the Association for Computational Linguistics: ACL 2024 , year =

work page 2024

[22] [24]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year =

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year =

work page

[23] [25]

2024 , eprint =

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens , author =. 2024 , eprint =

work page 2024

[24] [26]

2024 , eprint =

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models , author =. 2024 , eprint =

work page 2024

[25] [27]

2025 , eprint=

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens , author=. 2025 , eprint=

work page 2025

[26] [28]

2025 , eprint =

IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech , author =. 2025 , eprint =

work page 2025

[27] [29]

International Conference on Learning Representations (ICLR) , year =

ReAct: Synergizing Reasoning and Acting in Language Models , author =. International Conference on Learning Representations (ICLR) , year =

work page

[28] [30]

, author =

Multi-agent reinforcement learning for resources allocation optimization: a survey. , author =. Artificial Intelligence Review , year =

work page

[29] [31]

Proceedings of the Thirteenth International Conference on Learning Representations , year =

DoF: A Diffusion Factorization Framework for Offline Multi-Agent Reinforcement Learning , author =. Proceedings of the Thirteenth International Conference on Learning Representations , year =

work page

[30] [32]

and Cai, Tianhui and Huang, Zhiyu and Zhou, Bolei and Ma, Jiaqi , title =

Zhou, Zewei and Zhao, Seth Z. and Cai, Tianhui and Huang, Zhiyu and Zhou, Bolei and Ma, Jiaqi , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

work page 2025

[31] [33]

2025 , booktitle =

Yang, Guanrou and Yang, Chen and Chen, Qian and Ma, Ziyang and Chen, Wenxi and Wang, Wen and Wang, Tianrui and Yang, Yifan and others , title =. 2025 , booktitle =

work page 2025

[32] [34]

2025 , eprint=

AlphaAgents: Large Language Model based Multi-Agents for Equity Portfolio Constructions , author=. 2025 , eprint=

work page 2025

[33] [35]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

work page 2025

[34] [37]

Findings of the Association for Computational Linguistics: ACL 2025 , year =

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM , author =. Findings of the Association for Computational Linguistics: ACL 2025 , year =

work page 2025

[35] [38]

, booktitle=

Gao, Xiaoxue and Zhang, Chen and Chen, Yiming and Zhang, Huayun and Chen, Nancy F. , booktitle=. Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization , year=

work page

[36] [39]

2023 , eprint =

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , author =. 2023 , eprint =

work page 2023

[37] [40]

2024 , eprint =

MM-TTS: A Unified Framework for Multi-Modal Prompt-Based Emotional Text-to-Speech , author =. 2024 , eprint =

work page 2024

[38] [41]

2024 , eprint =

Qwen2-Audio Technical Report , author =. 2024 , eprint =

work page 2024

[39] [42]

2024 , eprint =

Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context , author =. 2024 , eprint =

work page 2024

[40] [43]

2024 , howpublished =

GPT-4o System Card , author =. 2024 , howpublished =

work page 2024

[41] [44]

2025 , eprint =

Qwen3 Technical Report , author =. 2025 , eprint =

work page 2025

[42] [45]

2023 , eprint =

Guiding FastSpeech2 Towards Emotional Text-to-Speech , author =. 2023 , eprint =

work page 2023

[43] [46]

Applied Sciences , year =

An Emotion Speech Synthesis Method Based on VITS , author =. Applied Sciences , year =

work page

[44] [47]

Advances in Neural Information Processing Systems (NeurIPS) , year =

GenerSpeech: Towards Style Transfer for Generalizable Out-of-Domain Text-to-Speech , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[45] [48]

Cognition & Emotion , volume =

An Argument for Basic Emotions , author =. Cognition & Emotion , volume =

work page

[46] [50]

Advances in Neural Information Processing Systems (NeurIPS) , year =

CAMEL: Communicative Agents for ``Mind'' Exploration of Large Language Model Society , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[47] [52]

IEEE International Conference on Multimedia & Expo (ICME) , year =

DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue , author =. IEEE International Conference on Multimedia & Expo (ICME) , year =

work page

[48] [54]

LongHorizon

Kang, Bin and Wen, Shaoguo and Bi, Yifei and Wu, Shunlong and Yuan, Xinbin and Shao, Rui and Wang, Junle and Tian, Zhuotao , booktitle =. LongHorizon

work page

[49] [55]

Less is More, But Where? Dynamic Token Compression via

Li, Yulin and Gui, Haokun and Fan, Ziyang and Wang, Junjie and Kang, Bin and Chen, Bin and Tian, Zhuotao , journal =. Less is More, But Where? Dynamic Token Compression via

work page

[50] [56]

Proceedings of the 14th International Conference on Learning Representations , year =

Efficient Reasoning with Balanced Thinking , author =. Proceedings of the 14th International Conference on Learning Representations , year =

work page

[51] [57]

Wang, Junjie and Chen, Bin and Li, Yulin and Kang, Bin and Chen, Yichi and Tian, Zhuotao , booktitle =

work page

[52] [58]

Wang, Junjie and Chen, Bin and Kang, Bin and Li, Yulin and Xian, Weizhi and Chen, Yichi and Xu, Yong , booktitle =

work page

[53] [60]

Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models

Li, Yanwei, Zhang, Yuechen, Wang, Chengyao, Zhong, Zhisheng, Chen, Yixin, Chu, Ruihang, Liu, Shaoteng, and Jia, Jiaya. Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models. IEEE Transactions on Pattern Analysis and Machine Intelligence. pages 1-14. 2025

work page 2025

[54] [61]

BLIP -2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Li, Junnan, Li, Dongxu, Savarese, Silvio, and Hoi, Steven. BLIP -2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. Proceedings of the 40th International Conference on Machine Learning. 202, pages 19730--19742. 2023

work page 2023

[55] [62]

CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval

Kang, Bin, Chen, Bin, Wang, Junjie, Li, Yulin, Zhao, Junzhi, Wang, Junle, and Tian, Zhuotao. CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval. Proceedings of the 33rd ACM International Conference on Multimedia. pages 5140--5149. 2025

work page 2025

[56] [63]

Less is More, But Where? Dynamic Token Compression via LLM -Guided Keyframe Prior

Li, Yulin, Gui, Haokun, Fan, Ziyang, Wang, Junjie, Kang, Bin, Chen, Bin, and Tian, Zhuotao. Less is More, But Where? Dynamic Token Compression via LLM -Guided Keyframe Prior. Advances in Neural Information Processing Systems. 38, pages 156861--156904. 2026

work page 2026

[57] [64]

Efficient Reasoning with Balanced Thinking

Li, Yulin, Tu, Tengyao, Ding, Li, Wang, Junjie, Zhen, Huiling, Chen, Yixin, Yong, Li, and Tian, Zhuotao. Efficient Reasoning with Balanced Thinking. Proceedings of the 14th International Conference on Learning Representations. 2026

work page 2026

[58] [65]

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

Zhang, Dong, Li, Shimin, Zhang, Xin, Zhan, Jun, Wang, Pengyu, Zhou, Yaqian, and Qiu, Xipeng. SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities. Findings of the Association for Computational Linguistics: EMNLP 2023. pages 15757-15773. 2023

work page 2023

[59] [66]

AudioPaLM: A Large Language Model That Can Speak and Listen

Rubenstein, Paul K., Asawaroengchai, Chulayuth, Nguyen, Duc Dung, Bapna, Ankur, Borsos, Zal \'a n, and others. AudioPaLM: A Large Language Model That Can Speak and Listen. arXiv preprint arXiv:2306.12925. 2023. arXiv:2306.12925

work page internal anchor Pith review Pith/arXiv arXiv 2023

[60] [68]

Recent Advances in Speech Language Models: A Survey

Cui, Wenqian, Yu, Dianzhi, Jiao, Xiaoqi, Meng, Ziqiao, Zhang, Guangyan, Wang, Qichao, Guo, Steven Y., King, Irwin, and others. Recent Advances in Speech Language Models: A Survey. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pages 13943--13970. 2025

work page 2025

[61] [69]

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Ju, Zeqian, Wang, Yuancheng, Shen, Kai, He, Lei, Tan, Xu, Liu, Eric, Leng, Yichong, Zhao, Sheng, Qin, Tao, and Bian, Jiang. NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models. Proceedings of the International Conference on Machine Learning. pages 59545--59570. 2024

work page 2024

[62] [70]

TTSlow: Slow Down Text-to-Speech With Efficiency Robustness Evaluations

Gao, Xiaoxue, Chen, Yiming, Yue, Xianghu, Tsao, Yu, and Chen, Nancy F. TTSlow: Slow Down Text-to-Speech With Efficiency Robustness Evaluations. IEEE Transactions on Audio, Speech and Language Processing. 33, pages 693-704. 2025

work page 2025

[63] [71]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chen, Sanyuan, Wang, Chengyi, Wu, Yu, Zhang, Ziqiang, Zhou, Long, Liu, Shujie, Chen, Zhuo, Liu, Yanqing, and others. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 33, pages 705-718. 2025

work page 2025

[64] [72]

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, Shimin Li, Zhaoye Fei, Qinyuan Cheng, and Xipeng Qiu. InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems. Manuscript. 2025. arXiv:2506.16381

work page arXiv 2025

[65] [73]

EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

Yang, Guanrou, Yang, Chen, Chen, Qian, Ma, Ziyang, Chen, Wenxi, Wang, Wen, Wang, Tianrui, and others. EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting. Proceedings of the 33rd ACM International Conference on Multimedia. pages 10748--10757. 2025

work page 2025

[66] [74]

Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,

Zhou, Siyi, Zhou, Yiquan, He, Yi, Zhou, Xun, Wang, Jinchao, Deng, Wei, and Shu, Jingchen. IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech. Manuscript. 2025. arXiv:2506.21619

work page arXiv 2025

[67] [75]

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Kim, Jaehyeon, Kong, Jungil, and Son, Juhee. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. Proceedings of the 38th International Conference on Machine Learning. 139, pages 5530--5540. 2021

work page 2021

[68] [76]

Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset

Zhou, Kun, Sisman, Berrak, Liu, Rui, and Li, Haizhou. Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pages 920-924. 2021

work page 2021

[69] [77]

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

Le, Matthew, Vyas, Apoorv, Shi, Bowen, Karrer, Brian, Sari, Leda, Moritz, Rashel, and others. Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale. Advances in Neural Information Processing Systems. 36, pages 14005--14034. 2023

work page 2023

[70] [78]

Diffsody: Disentangling Speaker-Invariant Prosody Representations via Diffusion Probabilistic Models

Qu, Leyuan, Zhang, Zhuo, Huang, Xiaoliang, Wen, Ping, Zhang, Rui, Wang, Wen, and others. Diffsody: Disentangling Speaker-Invariant Prosody Representations via Diffusion Probabilistic Models. IEEE Transactions on Neural Networks and Learning Systems. pages 1--12. 2025

work page 2025

[71] [79]

DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech

Cho, Deok-Hyeon, Oh, Hyung-Seok, Kim, Seung-Bin, and Lee, Seong-Whan. DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech. Proceedings of Interspeech. pages 4373--4377. 2025

work page 2025

[72] [80]

An Argument for Basic Emotions

Ekman, Paul. An Argument for Basic Emotions. Cognition & Emotion. 6(3--4), pages 169--200. 1992

work page 1992

[73] [81]

An Emotion Speech Synthesis Method Based on VITS

Zhao, Wei and Yang, Zheng. An Emotion Speech Synthesis Method Based on VITS. Applied Sciences. 13(4), pages 2225. 2023

work page 2023

[74] [82]

Guiding FastSpeech2 Towards Emotional Text-to-Speech

Ju, Zeqian and others. Guiding FastSpeech2 Towards Emotional Text-to-Speech. Manuscript. 2023. arXiv:2307.00024

work page arXiv 2023

[75] [83]

GenerSpeech: Towards Style Transfer for Generalizable Out-of-Domain Text-to-Speech

Zhao, Yue and others. GenerSpeech: Towards Style Transfer for Generalizable Out-of-Domain Text-to-Speech. Advances in Neural Information Processing Systems (NeurIPS). 2022

work page 2022

[76] [84]

Self-Refine: Iterative Refinement with Self-Feedback

Madaan, Aman, Tandon, Niket, Gupta, Prakhar, Hallinan, Skyler, Gao, Luyu, Wiegreffe, Sarah, Alon, Uri, and others. Self-Refine: Iterative Refinement with Self-Feedback. Advances in Neural Information Processing Systems. 36, pages 46534--46594. 2023

work page 2023

[77] [85]

Salman, Wei-Cheng Lin, and others

Carlos Busso, Reza Lotfian, Kusha Sridhar, Ali N. Salman, Wei-Cheng Lin, and others. The MSP-Podcast Corpus. Manuscript. 2025. arXiv:2509.09791

work page arXiv 2025

[78] [86]

emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

Ma, Ziyue, Zheng, Zhisheng, Ye, Jiaxin, Li, Jinchao, Gao, Zhifu, Zhang, ShiLiang, and others. emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation. Findings of the Association for Computational Linguistics: ACL 2024. pages 15747--15760. 2024

work page 2024

[79] [87]

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Chen, Yushen, Niu, Zhikang, Ma, Ziyang, Deng, Keqi, Wang, Chunhui, JianZhao, JianZhao, Yu, Kai, and Chen, Xie. F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. pages 6255--6271. 2025

work page 2025

[80] [88]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Du, Zhihao, Chen, Qian, Zhang, Shiliang, Hu, Kai, Lu, Heng, Yang, Yexin, Hu, Hangrui, and others. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens. Manuscript. 2024. arXiv:2407.05407

work page internal anchor Pith review Pith/arXiv arXiv 2024