FineCombo-TTS: Collaborative and Precise Controllable Speech Synthesis Using Text Descriptions and Reference Speech

Peiji Yang; Shuoyi Zhou; Yicheng Zhong; Yifan Hu; Yixuan Zhou; Zhisheng Wang; Zhiyong Wu

arxiv: 2606.19209 · v2 · pith:AMGX6GZDnew · submitted 2026-06-17 · 💻 cs.SD

FineCombo-TTS: Collaborative and Precise Controllable Speech Synthesis Using Text Descriptions and Reference Speech

Shuoyi Zhou , Yixuan Zhou , Peiji Yang , Yifan Hu , Yicheng Zhong , Zhisheng Wang , Zhiyong Wu This is my paper

Pith reviewed 2026-06-26 19:07 UTC · model grok-4.3

classification 💻 cs.SD

keywords controllable TTSreference speechtext descriptionsconditional flow matchingunified acoustic representationFineEdit datasetspeech variance predictorattribute control

0 comments

The pith

A unified acoustic representation plus flow matching lets text descriptions precisely adjust reference speech in TTS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FineCombo-TTS as a single framework that grounds synthesis in a reference voice while using text descriptions to guide specific changes. It avoids splitting sound into separate attributes and instead trains one shared acoustic space where a Conditional Flow Matching predictor learns the exact differences between source and target versions. A new paired dataset called FineEdit supplies explicit source-to-target examples to train relative control. If the method holds, users gain both the natural grounding of an example clip and the flexibility of language instructions in one system.

Core claim

FineCombo-TTS learns a unified acoustic representation and introduces a Conditional Flow Matching-based Speech Variance Predictor that models fine-grained reference-to-target transformations guided by text descriptions, without explicit attribute disentanglement, and is trained on the FineEdit structured paired dataset to support relative attribute control.

What carries the argument

The Conditional Flow Matching-based Speech Variance Predictor that operates inside a unified acoustic representation to predict text-guided changes from reference speech.

If this is right

Reference speech supplies concrete acoustic grounding while text supplies relative adjustments such as higher pitch or slower rate.
The FineEdit dataset enables training on explicit source-to-target pairs for relative rather than absolute control.
The same model produces expressive output by combining both input types instead of treating them separately.
Control remains precise even when multiple acoustic attributes must change together.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same unified-plus-predictor pattern could be tested on music or environmental sound generation where reference clips and text instructions coexist.
If the approach works, it reduces the engineering cost of building separate disentanglement modules for every new audio domain.
A direct test would measure whether the learned transformations generalize to attribute combinations never seen in the FineEdit pairs.

Load-bearing premise

A single shared acoustic representation is rich enough for the flow matching predictor to capture every needed attribute shift when text supplies the target direction.

What would settle it

An experiment in which listeners or objective metrics show no gain in precise attribute control when text descriptions are added to the reference input, or when the variance predictor is replaced by a simpler mapping.

Figures

Figures reproduced from arXiv: 2606.19209 by Peiji Yang, Shuoyi Zhou, Yicheng Zhong, Yifan Hu, Yixuan Zhou, Zhisheng Wang, Zhiyong Wu.

**Figure 1.** Figure 1: The proposed FineCombo-TTS enables precise control over the timbre, prosody, and emotion of the source (reference) speech. and practical flexibility. However, achieving such joint control is limited by both data and modeling challenges. Most description-based datasets consist of isolated speech–text pairs describing absolute acoustic properties. Consequently, models learn attribute–acoustic corresponden… view at source ↗

**Figure 2.** Figure 2: The overall architecture of FineCombo-TTS. grounded and preference-driven controllable speech synthesis. In particular, our model supports joint control with both reference speech and text descriptions, as well as single-condition scenarios, enabling zero-shot voice cloning and descriptiondriven speech generation. The main contributions are summarized as follows: • We propose FineCombo-TTS, the first co… view at source ↗

read the original abstract

Controllable text-to-speech (TTS) has become a key research focus. However, methods based on either reference speech or text descriptions lack flexibility and precise control, and recent joint approaches remain loosely coupled, with speech modeling timbre and text controlling global style. We propose FineCombo-TTS, a unified framework for speech synthesis grounded in reference speech and guided by text descriptions, enabling flexible and precise control over acoustic attributes. Instead of explicit attribute disentanglement, we learn a unified acoustic representation and introduce a Conditional Flow Matching (CFM)-based Speech Variance Predictor to model fine-grained reference-to-target transformations guided by text descriptions. To support relative attribute control, we construct FineEdit, a structured paired dataset that explicitly encodes source-to-target attribute variations. Experiments demonstrate that our approach achieves flexible, precise, and expressive controllable TTS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FineCombo-TTS combines reference speech and text descriptions via a unified acoustic representation and CFM predictor plus a new FineEdit dataset, but the abstract supplies no metrics or comparisons to support the performance claims.

read the letter

The paper's core move is to drop explicit attribute disentanglement and instead learn one acoustic representation, then use a Conditional Flow Matching predictor to turn reference speech into a target version steered by text descriptions. They also build FineEdit, a paired dataset that records source-to-target attribute shifts so the model can train on relative changes.

That dataset and the CFM choice are the concrete additions. The dataset construction looks practical for anyone who needs paired examples of controlled edits, and routing the transformation through flow matching is a reasonable way to avoid forcing separate timbre and style streams.

The main gap is evidence. The abstract states that experiments show flexible, precise, and expressive control, yet it gives no numbers, no baseline comparisons, no ablation on the CFM component, and no detail on how precision was measured. Without those, the claim that the unified representation plus flow matching actually delivers fine-grained reference-to-target shifts stays untested in the text we have.

The assumption that text can guide those shifts without any form of separation is plausible on paper but needs the full results to check. Minor issues like missing implementation specifics can be fixed; the missing quantitative support is the part that matters most.

This is for people already working on controllable TTS who want a new paired dataset or an example of flow matching in audio. A reader focused on practical voice systems could extract the dataset idea even if the model details need more scrutiny.

Send it to peer review. The framework and dataset are specific enough to warrant referee time, though the authors should expect questions on the experimental section.

Referee Report

1 major / 0 minor

Summary. The paper introduces FineCombo-TTS, a unified framework for controllable TTS that combines reference speech and text descriptions. It avoids explicit attribute disentanglement by learning a unified acoustic representation and employing a Conditional Flow Matching (CFM)-based Speech Variance Predictor to model fine-grained reference-to-target transformations. A new structured paired dataset FineEdit is constructed to support relative attribute control, and the abstract claims that experiments demonstrate flexible, precise, and expressive controllable TTS.

Significance. If the central claims hold with supporting metrics, baselines, and ablations, the work could advance controllable TTS by offering an integrated alternative to loosely coupled reference-plus-text methods. The CFM predictor and FineEdit dataset construction would be the primary contributions, provided they deliver measurable gains in precision without requiring attribute separation.

major comments (1)

[Abstract] Abstract: the claim that 'experiments demonstrate that our approach achieves flexible, precise, and expressive controllable TTS' is unsupported by any reported metrics, baselines, error bars, dataset statistics, or method details. This absence prevents evaluation of whether the unified acoustic representation plus CFM predictor actually delivers fine-grained control.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the opportunity to clarify the support for our claims. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'experiments demonstrate that our approach achieves flexible, precise, and expressive controllable TTS' is unsupported by any reported metrics, baselines, error bars, dataset statistics, or method details. This absence prevents evaluation of whether the unified acoustic representation plus CFM predictor actually delivers fine-grained control.

Authors: The abstract is intentionally concise and summarizes findings from the full manuscript. Section 3 details the unified acoustic representation and CFM-based Speech Variance Predictor. Section 4.1 describes the FineEdit dataset construction with explicit source-to-target attribute statistics and pairing methodology. Section 5 reports objective metrics (e.g., MCD, F0 RMSE), subjective evaluations (MOS, preference tests), comparisons to baselines including YourTTS and StyleTTS, ablation studies isolating the CFM predictor, and error bars from repeated runs with statistical significance. These results directly support the abstract claim regarding flexible and precise control. We will revise the abstract to include one sentence referencing key quantitative gains for improved self-containment. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided abstract and description present FineCombo-TTS as a standard architectural extension: a unified acoustic representation combined with a CFM-based predictor, plus construction of a new paired dataset (FineEdit) for relative control. No equations, loss terms, or derivation steps are supplied that reduce a claimed prediction to a fitted input by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim rests on experimental outcomes rather than definitional equivalence or self-referential fitting, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no details on free parameters, axioms, or invented entities provided.

pith-pipeline@v0.9.1-grok · 5692 in / 882 out tokens · 27295 ms · 2026-06-26T19:07:34.511197+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 6 linked inside Pith

[1]

Existing controllable methods can be di- vided into three categories: 1) reference speech-based, 2) text description-based, and 3) joint speech-description control

Introduction In recent years, text-to-speech (TTS) technology has achieved near human-level naturalness and intelligibility [1, 2, 3], shift- ing research toward higher-level goals such as expressive and controllable synthesis. Existing controllable methods can be di- vided into three categories: 1) reference speech-based, 2) text description-based, and 3...

Pith/arXiv arXiv 2026
[2]

The model comprises three modules: a Speech Attributes Ex- tractor, a Speech Variance Predictor, and a TTS backbone

Methodology The architecture of our proposed model is illustrated in Figue 2. The model comprises three modules: a Speech Attributes Ex- tractor, a Speech Variance Predictor, and a TTS backbone. The Speech Attributes Extractor encodes prompt speech into a uni- fied embeddingE a via a timbre extractor and residual style en- coder, capturing key acoustic ch...
[3]

Change the style, speak with an angry emotion

Dataset: FineEdit Existing speech–description datasets provide only isolated speech–text pairs describing absolute acoustic properties, which are insufficient for learning reference-conditioned attribute transformations. To address this limitation, we constructFi- neEdit, a structured paired English speech dataset designed to model relative attribute vari...
[4]

Experiment 4.1. Experiment Setup In training stage 1, we pre-train on Multilingual LibriSpeech (MLS, 45k hours) [28] and LibriTTS-R (585 hours) [29], and then fine-tune on EmoV oice-DB (45 hours) [30] and Textrol- Speech (330 hours) [27] for emotion modeling. In stage 2, we train the Speech Variance Predictor using 236K descrip- tion–speech pairs from Tex...

arXiv
[5]

Conclusion In this paper, we propose FineCombo-TTS, a controllable TTS framework that tightly integrates reference speech and text de- scriptions for precise and flexible speech synthesis. Unlike prior pseudo-collaborative approaches, our method unifies both modalities within a shared acoustic attribute space, enabling reference-grounded and text-guided a...
[6]

These tools were not used to generate any core scientific ideas, experimen- tal data, or technical contributions

Generative AI Use Disclosure During the preparation of this manuscript, the authors used gen- erative AI tools exclusively for the purpose of language editing and manuscript polishing to improve readability. These tools were not used to generate any core scientific ideas, experimen- tal data, or technical contributions. All authors have thoroughly reviewe...
[7]

Acknowledgments This work was supported by National Natural Science Founda- tion of China (62076144) and National Social Science Founda- tion of China (13&ZD189)
[8]

Fastspeech 2: Fast and high-quality end-to-end text to speech,

Y . Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y . Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020

arXiv 2006
[9]

Fastspeech: Fast, robust and controllable text to speech,

Y . Ren, Y . Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y . Liu, “Fastspeech: Fast, robust and controllable text to speech,” Advances in neural information processing systems, 2019

2019
[10]

Tacotron: Towards end-to-end speech synthesis,

Y . Wang, R. Skerry-Ryan, D. Stanton, Y . Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y . Xiao, Z. Chen, S. Bengioet al., “Tacotron: Towards end-to-end speech synthesis,”arXiv preprint arXiv:1703.10135, 2017

Pith/arXiv arXiv 2017
[11]

Neural codec language mod- els are zero-shot text to speech synthesizers,

C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Liet al., “Neural codec language mod- els are zero-shot text to speech synthesizers,”arXiv preprint arXiv:2301.02111, 2023

Pith/arXiv arXiv 2023
[12]

Speak, read and prompt: High-fidelity text-to-speech with min- imal supervision,

E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, M. Sharifi, M. Tagliasacchi, and N. Zeghidour, “Speak, read and prompt: High-fidelity text-to-speech with min- imal supervision,”Transactions of the Association for Computa- tional Linguistics, pp. 1703–1718, 2023

2023
[13]

Improving language model-based zero- shot text-to-speech synthesis with multi-scale acoustic prompts,

S. Lei, Y . Zhou, L. Chen, D. Luo, Z. Wu, X. Wu, S. Kang, T. Jiang, Y . Zhou, Y . Hanet al., “Improving language model-based zero- shot text-to-speech synthesis with multi-scale acoustic prompts,” inICASSP 2024-2024 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 662–12 666

2024
[14]

Prompttts: Control- lable text-to-speech with text descriptions,

Z. Guo, Y . Leng, Y . Wu, S. Zhao, and X. Tan, “Prompttts: Control- lable text-to-speech with text descriptions,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023
[15]

Instructtts: Modelling expressive tts in discrete latent space with natural lan- guage style prompt,

D. Yang, S. Liu, R. Huang, C. Weng, and H. Meng, “Instructtts: Modelling expressive tts in discrete latent space with natural lan- guage style prompt,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 2913–2925, 2024

2024
[16]

Promptstyle: Controllable style transfer for text-to-speech with natural language descriptions,

G. Liu, Y . Zhang, Y . Lei, Y . Chen, R. Wang, Z. Li, and L. Xie, “Promptstyle: Controllable style transfer for text-to-speech with natural language descriptions,”arXiv preprint arXiv:2305.19522, 2023

arXiv 2023
[17]

Controlspeech: Towards si- multaneous and independent zero-shot speaker cloning and zero- shot language style control,

S. Ji, Q. Chen, W. Wang, J. Zuo, M. Fang, Z. Jiang, H. Huang, Z. Wang, X. Cheng, S. Zhenget al., “Controlspeech: Towards si- multaneous and independent zero-shot speaker cloning and zero- shot language style control,” inthe 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), 2025, pp. 6966–6981

2025
[18]

V oxinstruct: Expressive human instruction-to-speech generation with unified multilingual codec language modelling,

Y . Zhou, X. Qin, Z. Jin, S. Zhou, S. Lei, S. Zhou, Z. Wu, and J. Jia, “V oxinstruct: Expressive human instruction-to-speech generation with unified multilingual codec language modelling,” inthe 32nd ACM International Conference on Multimedia, 2024, pp. 554– 563

2024
[19]

Flespeech: Flexibly controllable speech generation with various prompts,

H. Li, Y . Li, X. Wang, J. Hu, Q. Xie, S. Yang, and L. Xie, “Flespeech: Flexibly controllable speech generation with various prompts,”arXiv preprint arXiv:2501.04644, 2025

arXiv 2025
[20]

Scal- ing instruction-finetuned language models,

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fe- dus, Y . Li, X. Wang, M. Dehghani, S. Brahmaet al., “Scal- ing instruction-finetuned language models,”Journal of Machine Learning Research, no. 70, pp. 1–53, 2024

2024
[21]

High-fidelity audio compression with improved rvqgan,

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved rvqgan,”Ad- vances in Neural Information Processing Systems, pp. 27 980– 27 993, 2023

2023
[22]

Naturalspeech 3: Zero-shot speech syn- thesis with factorized codec and diffusion models,

Z. Ju, Y . Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y . Liu, Y . Leng, K. Song, S. Tanget al., “Naturalspeech 3: Zero-shot speech syn- thesis with factorized codec and diffusion models,”arXiv preprint arXiv:2403.03100, 2024

arXiv 2024
[23]

Meta-stylespeech: Multi-speaker adaptive text-to-speech generation,

D. Min, D. B. Lee, E. Yang, and S. J. Hwang, “Meta-stylespeech: Multi-speaker adaptive text-to-speech generation,” inInterna- tional Conference on Machine Learning. PMLR, 2021, pp. 7748–7759

2021
[24]

Latent space editing in transformer-based flow matching,

V . T. Hu, W. Zhang, M. Tang, P. Mettes, D. Zhao, and C. Snoek, “Latent space editing in transformer-based flow matching,” in AAAI conference on artificial intelligence, 2024, pp. 2247–2255

2024
[25]

Imagic: Text-based real image edit- ing with diffusion models,

B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image edit- ing with diffusion models,” inIEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 6007–6017

2023
[26]

Motion flow matching for human motion synthesis and editing,

V . T. Hu, W. Yin, P. Ma, Y . Chen, B. Fernando, Y . M. Asano, E. Gavves, P. Mettes, B. Ommer, and C. G. Snoek, “Motion flow matching for human motion synthesis and editing,”arXiv preprint arXiv:2312.08895, 2023

arXiv 2023
[27]

Flow matching for generative modeling,

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022
[28]

Classifier-free diffusion guidance,

J. Ho and T. Salimans, “Classifier-free diffusion guidance,”arXiv preprint arXiv:2207.12598, 2022

Pith/arXiv arXiv 2022
[29]

Simple and controllable music gener- ation,

J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y . Adi, and A. D´efossez, “Simple and controllable music gener- ation,”Advances in Neural Information Processing Systems, pp. 47 704–47 720, 2023

2023
[30]

Natural language guidance of high- fidelity text-to-speech with synthetic annotations,

D. Lyth and S. King, “Natural language guidance of high- fidelity text-to-speech with synthetic annotations,”arXiv preprint arXiv:2402.01912, 2024

arXiv 2024
[31]

Libritts-r: A restored multi-speaker text-to-speech corpus,

Y . Koizumi, H. Zen, S. Karita, Y . Ding, K. Yatabe, N. Morioka, M. Bacchiani, Y . Zhang, W. Han, and A. Bapna, “Libritts-r: A restored multi-speaker text-to-speech corpus,”arXiv preprint arXiv:2305.18802, 2023

arXiv 2023
[32]

Emotional voice con- version: Theory, databases and esd,

K. Zhou, B. Sisman, R. Liu, and H. Li, “Emotional voice con- version: Theory, databases and esd,”Speech Communication, pp. 1–18, 2022

2022
[33]

Libritts-p: A corpus with speaking style and speaker identity prompts for text-to-speech and style captioning,

M. Kawamura, R. Yamamoto, Y . Shirahata, T. Hasumi, and K. Tachibana, “Libritts-p: A corpus with speaking style and speaker identity prompts for text-to-speech and style captioning,” arXiv preprint arXiv:2406.07969, 2024

arXiv 2024
[34]

Textrolspeech: A text style control speech corpus with codec language text-to-speech models,

S. Ji, J. Zuo, M. Fang, Z. Jiang, F. Chen, X. Duan, B. Huai, and Z. Zhao, “Textrolspeech: A text style control speech corpus with codec language text-to-speech models,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10 301–10 305

2024
[35]

Mls: A large-scale multilingual dataset for speech research,

V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “Mls: A large-scale multilingual dataset for speech research,” arXiv preprint arXiv:2012.03411, 2020

Pith/arXiv arXiv 2012
[36]

Librivox: Free public domain audiobooks,

J. Kearns, “Librivox: Free public domain audiobooks,”Reference Reviews, pp. 7–8, 2014

2014
[37]

Emovoice: Llm-based emo- tional text-to-speech model with freestyle text prompting,

G. Yang, C. Yang, Q. Chen, Z. Ma, W. Chen, W. Wang, T. Wang, Y . Yang, Z. Niu, W. Liuet al., “Emovoice: Llm-based emo- tional text-to-speech model with freestyle text prompting,”arXiv preprint arXiv:2504.12867, 2025

arXiv 2025

[1] [1]

Existing controllable methods can be di- vided into three categories: 1) reference speech-based, 2) text description-based, and 3) joint speech-description control

Introduction In recent years, text-to-speech (TTS) technology has achieved near human-level naturalness and intelligibility [1, 2, 3], shift- ing research toward higher-level goals such as expressive and controllable synthesis. Existing controllable methods can be di- vided into three categories: 1) reference speech-based, 2) text description-based, and 3...

Pith/arXiv arXiv 2026

[2] [2]

The model comprises three modules: a Speech Attributes Ex- tractor, a Speech Variance Predictor, and a TTS backbone

Methodology The architecture of our proposed model is illustrated in Figue 2. The model comprises three modules: a Speech Attributes Ex- tractor, a Speech Variance Predictor, and a TTS backbone. The Speech Attributes Extractor encodes prompt speech into a uni- fied embeddingE a via a timbre extractor and residual style en- coder, capturing key acoustic ch...

[3] [3]

Change the style, speak with an angry emotion

Dataset: FineEdit Existing speech–description datasets provide only isolated speech–text pairs describing absolute acoustic properties, which are insufficient for learning reference-conditioned attribute transformations. To address this limitation, we constructFi- neEdit, a structured paired English speech dataset designed to model relative attribute vari...

[4] [4]

Experiment 4.1. Experiment Setup In training stage 1, we pre-train on Multilingual LibriSpeech (MLS, 45k hours) [28] and LibriTTS-R (585 hours) [29], and then fine-tune on EmoV oice-DB (45 hours) [30] and Textrol- Speech (330 hours) [27] for emotion modeling. In stage 2, we train the Speech Variance Predictor using 236K descrip- tion–speech pairs from Tex...

arXiv

[5] [5]

Conclusion In this paper, we propose FineCombo-TTS, a controllable TTS framework that tightly integrates reference speech and text de- scriptions for precise and flexible speech synthesis. Unlike prior pseudo-collaborative approaches, our method unifies both modalities within a shared acoustic attribute space, enabling reference-grounded and text-guided a...

[6] [6]

These tools were not used to generate any core scientific ideas, experimen- tal data, or technical contributions

Generative AI Use Disclosure During the preparation of this manuscript, the authors used gen- erative AI tools exclusively for the purpose of language editing and manuscript polishing to improve readability. These tools were not used to generate any core scientific ideas, experimen- tal data, or technical contributions. All authors have thoroughly reviewe...

[7] [7]

Acknowledgments This work was supported by National Natural Science Founda- tion of China (62076144) and National Social Science Founda- tion of China (13&ZD189)

[8] [8]

Fastspeech 2: Fast and high-quality end-to-end text to speech,

Y . Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y . Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020

arXiv 2006

[9] [9]

Fastspeech: Fast, robust and controllable text to speech,

Y . Ren, Y . Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y . Liu, “Fastspeech: Fast, robust and controllable text to speech,” Advances in neural information processing systems, 2019

2019

[10] [10]

Tacotron: Towards end-to-end speech synthesis,

Y . Wang, R. Skerry-Ryan, D. Stanton, Y . Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y . Xiao, Z. Chen, S. Bengioet al., “Tacotron: Towards end-to-end speech synthesis,”arXiv preprint arXiv:1703.10135, 2017

Pith/arXiv arXiv 2017

[11] [11]

Neural codec language mod- els are zero-shot text to speech synthesizers,

C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Liet al., “Neural codec language mod- els are zero-shot text to speech synthesizers,”arXiv preprint arXiv:2301.02111, 2023

Pith/arXiv arXiv 2023

[12] [12]

Speak, read and prompt: High-fidelity text-to-speech with min- imal supervision,

E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, M. Sharifi, M. Tagliasacchi, and N. Zeghidour, “Speak, read and prompt: High-fidelity text-to-speech with min- imal supervision,”Transactions of the Association for Computa- tional Linguistics, pp. 1703–1718, 2023

2023

[13] [13]

Improving language model-based zero- shot text-to-speech synthesis with multi-scale acoustic prompts,

S. Lei, Y . Zhou, L. Chen, D. Luo, Z. Wu, X. Wu, S. Kang, T. Jiang, Y . Zhou, Y . Hanet al., “Improving language model-based zero- shot text-to-speech synthesis with multi-scale acoustic prompts,” inICASSP 2024-2024 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 662–12 666

2024

[14] [14]

Prompttts: Control- lable text-to-speech with text descriptions,

Z. Guo, Y . Leng, Y . Wu, S. Zhao, and X. Tan, “Prompttts: Control- lable text-to-speech with text descriptions,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023

[15] [15]

Instructtts: Modelling expressive tts in discrete latent space with natural lan- guage style prompt,

D. Yang, S. Liu, R. Huang, C. Weng, and H. Meng, “Instructtts: Modelling expressive tts in discrete latent space with natural lan- guage style prompt,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 2913–2925, 2024

2024

[16] [16]

Promptstyle: Controllable style transfer for text-to-speech with natural language descriptions,

G. Liu, Y . Zhang, Y . Lei, Y . Chen, R. Wang, Z. Li, and L. Xie, “Promptstyle: Controllable style transfer for text-to-speech with natural language descriptions,”arXiv preprint arXiv:2305.19522, 2023

arXiv 2023

[17] [17]

Controlspeech: Towards si- multaneous and independent zero-shot speaker cloning and zero- shot language style control,

S. Ji, Q. Chen, W. Wang, J. Zuo, M. Fang, Z. Jiang, H. Huang, Z. Wang, X. Cheng, S. Zhenget al., “Controlspeech: Towards si- multaneous and independent zero-shot speaker cloning and zero- shot language style control,” inthe 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), 2025, pp. 6966–6981

2025

[18] [18]

V oxinstruct: Expressive human instruction-to-speech generation with unified multilingual codec language modelling,

Y . Zhou, X. Qin, Z. Jin, S. Zhou, S. Lei, S. Zhou, Z. Wu, and J. Jia, “V oxinstruct: Expressive human instruction-to-speech generation with unified multilingual codec language modelling,” inthe 32nd ACM International Conference on Multimedia, 2024, pp. 554– 563

2024

[19] [19]

Flespeech: Flexibly controllable speech generation with various prompts,

H. Li, Y . Li, X. Wang, J. Hu, Q. Xie, S. Yang, and L. Xie, “Flespeech: Flexibly controllable speech generation with various prompts,”arXiv preprint arXiv:2501.04644, 2025

arXiv 2025

[20] [20]

Scal- ing instruction-finetuned language models,

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fe- dus, Y . Li, X. Wang, M. Dehghani, S. Brahmaet al., “Scal- ing instruction-finetuned language models,”Journal of Machine Learning Research, no. 70, pp. 1–53, 2024

2024

[21] [21]

High-fidelity audio compression with improved rvqgan,

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved rvqgan,”Ad- vances in Neural Information Processing Systems, pp. 27 980– 27 993, 2023

2023

[22] [22]

Naturalspeech 3: Zero-shot speech syn- thesis with factorized codec and diffusion models,

Z. Ju, Y . Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y . Liu, Y . Leng, K. Song, S. Tanget al., “Naturalspeech 3: Zero-shot speech syn- thesis with factorized codec and diffusion models,”arXiv preprint arXiv:2403.03100, 2024

arXiv 2024

[23] [23]

Meta-stylespeech: Multi-speaker adaptive text-to-speech generation,

D. Min, D. B. Lee, E. Yang, and S. J. Hwang, “Meta-stylespeech: Multi-speaker adaptive text-to-speech generation,” inInterna- tional Conference on Machine Learning. PMLR, 2021, pp. 7748–7759

2021

[24] [24]

Latent space editing in transformer-based flow matching,

V . T. Hu, W. Zhang, M. Tang, P. Mettes, D. Zhao, and C. Snoek, “Latent space editing in transformer-based flow matching,” in AAAI conference on artificial intelligence, 2024, pp. 2247–2255

2024

[25] [25]

Imagic: Text-based real image edit- ing with diffusion models,

B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image edit- ing with diffusion models,” inIEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 6007–6017

2023

[26] [26]

Motion flow matching for human motion synthesis and editing,

V . T. Hu, W. Yin, P. Ma, Y . Chen, B. Fernando, Y . M. Asano, E. Gavves, P. Mettes, B. Ommer, and C. G. Snoek, “Motion flow matching for human motion synthesis and editing,”arXiv preprint arXiv:2312.08895, 2023

arXiv 2023

[27] [27]

Flow matching for generative modeling,

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022

[28] [28]

Classifier-free diffusion guidance,

J. Ho and T. Salimans, “Classifier-free diffusion guidance,”arXiv preprint arXiv:2207.12598, 2022

Pith/arXiv arXiv 2022

[29] [29]

Simple and controllable music gener- ation,

J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y . Adi, and A. D´efossez, “Simple and controllable music gener- ation,”Advances in Neural Information Processing Systems, pp. 47 704–47 720, 2023

2023

[30] [30]

Natural language guidance of high- fidelity text-to-speech with synthetic annotations,

D. Lyth and S. King, “Natural language guidance of high- fidelity text-to-speech with synthetic annotations,”arXiv preprint arXiv:2402.01912, 2024

arXiv 2024

[31] [31]

Libritts-r: A restored multi-speaker text-to-speech corpus,

Y . Koizumi, H. Zen, S. Karita, Y . Ding, K. Yatabe, N. Morioka, M. Bacchiani, Y . Zhang, W. Han, and A. Bapna, “Libritts-r: A restored multi-speaker text-to-speech corpus,”arXiv preprint arXiv:2305.18802, 2023

arXiv 2023

[32] [32]

Emotional voice con- version: Theory, databases and esd,

K. Zhou, B. Sisman, R. Liu, and H. Li, “Emotional voice con- version: Theory, databases and esd,”Speech Communication, pp. 1–18, 2022

2022

[33] [33]

Libritts-p: A corpus with speaking style and speaker identity prompts for text-to-speech and style captioning,

M. Kawamura, R. Yamamoto, Y . Shirahata, T. Hasumi, and K. Tachibana, “Libritts-p: A corpus with speaking style and speaker identity prompts for text-to-speech and style captioning,” arXiv preprint arXiv:2406.07969, 2024

arXiv 2024

[34] [34]

Textrolspeech: A text style control speech corpus with codec language text-to-speech models,

S. Ji, J. Zuo, M. Fang, Z. Jiang, F. Chen, X. Duan, B. Huai, and Z. Zhao, “Textrolspeech: A text style control speech corpus with codec language text-to-speech models,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10 301–10 305

2024

[35] [35]

Mls: A large-scale multilingual dataset for speech research,

V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “Mls: A large-scale multilingual dataset for speech research,” arXiv preprint arXiv:2012.03411, 2020

Pith/arXiv arXiv 2012

[36] [36]

Librivox: Free public domain audiobooks,

J. Kearns, “Librivox: Free public domain audiobooks,”Reference Reviews, pp. 7–8, 2014

2014

[37] [37]

Emovoice: Llm-based emo- tional text-to-speech model with freestyle text prompting,

G. Yang, C. Yang, Q. Chen, Z. Ma, W. Chen, W. Wang, T. Wang, Y . Yang, Z. Niu, W. Liuet al., “Emovoice: Llm-based emo- tional text-to-speech model with freestyle text prompting,”arXiv preprint arXiv:2504.12867, 2025

arXiv 2025