arxiv: 2412.10117 · v3 · submitted 2024-12-13 · 💻 cs.SD · cs.AI· cs.LG· eess.AS

Recognition: 3 theorem links

· Lean Theorem

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du , Yuxuan Wang , Qian Chen , Xian Shi , Xiang Lv , Tianyu Zhao , Zhifu Gao , Yexin Yang

show 11 more authors

Changfeng Gao Hui Wang Fan Yu Huadai Liu Zhengyan Sheng Yue Gu Chong Deng Wen Wang Shiliang Zhang Zhijie Yan Jingren Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:09 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.LGeess.AS

keywords speech synthesisstreaming TTSlarge language modelsflow matchingmultilingual speechfinite scalar quantizationcausal modelingvoice cloning

0 comments

The pith

CosyVoice 2 reaches human-parity naturalness and near-zero latency in streaming speech synthesis via LLM optimizations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CosyVoice 2 as an updated multilingual speech synthesis system built on large language models. It adds finite-scalar quantization to raise speech token efficiency, streamlines the core model so a pre-trained LLM can serve directly as backbone, and introduces chunk-aware causal flow matching that runs both streaming and non-streaming modes inside one network. Trained on large-scale multilingual data, the system is reported to match human naturalness while keeping response latency minimal and synthesis quality virtually lossless even in real-time streaming. A reader would care because these changes target the latency and quality barriers that currently limit voice interfaces in live applications.

Core claim

CosyVoice 2 incorporates finite-scalar quantization to improve codebook utilization of speech tokens, streamlines the text-speech language model to allow direct use of a pre-trained LLM as backbone, and develops a chunk-aware causal flow matching model to support streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode.

What carries the argument

Chunk-aware causal flow matching model, which processes audio in chunks to enable streaming while preserving full quality alongside finite-scalar quantization for improved token efficiency.

Load-bearing premise

The listed changes in quantization, architecture streamlining, and causal flow matching are what produce the human-parity naturalness and lossless streaming results.

What would settle it

A controlled listening test in which raters score CosyVoice 2 streaming outputs against matched human recordings on naturalness and intelligibility, with average scores falling measurably below human parity.

read the original abstract

In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, language models (LMs) and Flow Matching, CosyVoice demonstrated high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, significant progress has been made in multi-modal large language models (LLMs), where the response latency and real-time factor of speech synthesis play a crucial role in the interactive experience. Therefore, in this report, we present an improved streaming speech synthesis model, CosyVoice 2, which incorporates comprehensive and systematic optimizations. Specifically, we introduce finite-scalar quantization to improve the codebook utilization of speech tokens. For the text-speech LM, we streamline the model architecture to allow direct use of a pre-trained LLM as the backbone. In addition, we develop a chunk-aware causal flow matching model to support various synthesis scenarios, enabling both streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode. We invite readers to listen to the demos at https://funaudiollm.github.io/cosyvoice2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CosyVoice 2 refines the prior model with finite-scalar quantization and chunk-aware causal flow matching to enable streaming, but the human-parity claims need the actual metrics to hold weight.

read the letter

The main thing here is that CosyVoice 2 takes the original CosyVoice setup and adds streaming support through a chunk-aware causal flow matching model plus finite-scalar quantization to get better use out of the speech tokens. They also simplify the text-to-speech language model so a pre-trained LLM can serve as the backbone directly. These are practical engineering moves aimed at lower latency in interactive settings while keeping one model that handles both streaming and non-streaming cases. The large-scale multilingual training data is a reasonable step to improve consistency across languages. The paper lays out the architecture changes clearly enough that someone could follow the high-level recipe. The single-model design for different modes is a useful detail if it works without extra overhead. The soft spots are the performance assertions. The text states human-parity naturalness, minimal response latency, and virtually lossless streaming quality, yet the summary supplies no MOS scores, latency tables, baseline comparisons, or ablation results on the new pieces. Without those numbers it is difficult to judge how much the quantization and flow-matching changes actually deliver versus just more data. The free parameters around chunk size and quantization levels are noted but not tested in the description. This paper is for engineers building real-time voice systems who want low-latency TTS ideas. A reader already working with LLM-based audio models would pick up the token efficiency and causality tweaks. It has enough concrete components to merit peer review rather than a desk reject, provided the full manuscript adds the missing quantitative evaluations and error analysis. I would send it out for referees with a request to front-load the metrics and ablations.

Referee Report

2 major / 1 minor

Summary. The manuscript presents CosyVoice 2, an improved version of the prior CosyVoice model for multilingual speech synthesis. It incorporates finite-scalar quantization to enhance speech token codebook utilization, streamlines the text-speech language model to directly leverage a pre-trained LLM backbone, and introduces a chunk-aware causal flow matching model that supports both streaming and non-streaming synthesis in one architecture. Trained on a large-scale multilingual dataset, the work claims human-parity naturalness, minimal response latency, and virtually lossless quality specifically in streaming mode.

Significance. If the performance claims are substantiated, the work would represent a practical advance in low-latency, high-fidelity streaming TTS for interactive multimodal LLM applications, particularly by unifying streaming and non-streaming capabilities and improving token efficiency through the listed optimizations.

major comments (2)

[Abstract] Abstract: The central claims of 'human-parity naturalness,' 'minimal response latency,' and 'virtually lossless synthesis quality' in streaming mode are asserted without any quantitative metrics, objective/subjective scores, baseline comparisons, ablation studies, or error analysis. This absence directly undermines evaluation of whether the finite-scalar quantization, streamlined LLM backbone, or chunk-aware causal flow matching produce the stated gains.
[Architecture and Training sections] Architecture and Training sections: The descriptions of the three optimizations remain high-level narrative without equations, complexity analysis, or controlled experiments showing how each change (e.g., scalar quantization levels or chunk causality constraints) causally improves the reported metrics over the original CosyVoice.

minor comments (1)

[Abstract] Abstract: The demo link is useful; however, the text should clarify the exact definition of 'virtually lossless' (e.g., with respect to which reference signal or perceptual metric).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each of the major comments below and indicate the revisions we plan to make.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 'human-parity naturalness,' 'minimal response latency,' and 'virtually lossless synthesis quality' in streaming mode are asserted without any quantitative metrics, objective/subjective scores, baseline comparisons, ablation studies, or error analysis. This absence directly undermines evaluation of whether the finite-scalar quantization, streamlined LLM backbone, or chunk-aware causal flow matching produce the stated gains.

Authors: We acknowledge that the abstract does not contain specific numerical values, which is typical for abstracts to remain concise. The full paper contains extensive evaluation results in the Experiments section, including objective metrics like word error rate, speaker similarity scores, mean opinion scores (MOS) for naturalness, response latency measurements, comparisons against multiple baselines, and ablation studies on the individual components. To strengthen the abstract, we will add a few key quantitative results, such as the achieved MOS scores and latency values, to better substantiate the claims. revision: yes
Referee: [Architecture and Training sections] Architecture and Training sections: The descriptions of the three optimizations remain high-level narrative without equations, complexity analysis, or controlled experiments showing how each change (e.g., scalar quantization levels or chunk causality constraints) causally improves the reported metrics over the original CosyVoice.

Authors: The current descriptions aim to provide an accessible overview. We agree that adding more technical depth would be beneficial. In the revision, we will introduce equations for finite-scalar quantization, including the specific quantization levels and how they enhance codebook utilization compared to the previous approach. For the streamlined LLM, we will include details on the architecture modifications, parameter counts, and a complexity analysis. For the chunk-aware causal flow matching, we will provide the formulation of the causal mechanism and chunk processing. Furthermore, we will enhance the ablation studies to more clearly demonstrate the contribution of each optimization through controlled comparisons to the original CosyVoice model, showing improvements in the relevant metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an engineering progression from prior CosyVoice work through new optimizations (finite-scalar quantization, streamlined LLM backbone, chunk-aware causal flow matching) trained on large-scale multilingual data, with performance claims resting on empirical results rather than any closed-form derivation. No equations, fitted parameters renamed as predictions, or self-citation chains reduce the central claims to inputs by construction; the self-reference to previous work is purely contextual and not load-bearing for the reported human-parity or lossless outcomes.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions plus several unstated modeling choices typical of LLM-based TTS; no new physical axioms or invented entities are introduced.

free parameters (2)

quantization codebook size and scalar levels
Finite-scalar quantization parameters are chosen to improve utilization; exact values not given in abstract but required for the token efficiency claim.
chunk size and causality constraints
Hyperparameters controlling streaming chunk length and causal masking are fitted or tuned to achieve low latency.

axioms (2)

domain assumption Discrete speech tokens from supervised training capture sufficient prosody and content for high-quality synthesis
Invoked when stating that progressive semantic decoding yields human-parity naturalness.
domain assumption Pre-trained LLM weights transfer effectively to text-to-speech token prediction without major retraining
Used when streamlining the model to directly employ a pre-trained LLM backbone.

pith-pipeline@v0.9.0 · 5607 in / 1461 out tokens · 61062 ms · 2026-05-13T06:09:01.262029+00:00 · methodology

discussion (0)

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
cs.SD 2026-05 unverdicted novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
cs.CL 2026-05 unverdicted novelty 7.0

Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
eess.AS 2026-05 unverdicted novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
cs.CL 2026-05 unverdicted novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation
cs.SD 2026-05 unverdicted novelty 7.0

Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.
MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech
eess.AS 2026-04 unverdicted novelty 7.0

MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.
NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations
cs.SD 2026-04 unverdicted novelty 7.0

NVBench provides a standardized bilingual benchmark and evaluation protocol for assessing non-verbal vocalization generation, placement, and salience in text-to-speech systems.
AST: Adaptive, Seamless, and Training-Free Precise Speech Editing
cs.SD 2026-04 unverdicted novelty 7.0

AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency wi...
CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing
cs.SD 2026-04 unverdicted novelty 7.0

CoSyncDiT is a cognitive-inspired diffusion transformer that achieves state-of-the-art lip synchronization and naturalness in movie dubbing by guiding noise-to-speech generation through acoustic, visual, and contextua...
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
cs.SD 2026-04 unverdicted novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
eess.AS 2026-03 unverdicted novelty 7.0

FLAIR enables spoken dialogue AI to conduct continuous latent reasoning while perceiving speech through recursive latent embeddings and an ELBO-based finetuning objective.
AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling
cs.SD 2026-05 unverdicted novelty 6.0

AuDirector is a self-reflective closed-loop multi-agent framework that generates immersive audio narratives with improved structural coherence, emotional expressiveness, and acoustic fidelity via identity-aware voice ...
Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation
eess.AS 2026-05 unverdicted novelty 6.0

L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
cs.CV 2026-05 unverdicted novelty 6.0

CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis
cs.CL 2026-04 unverdicted novelty 6.0

TTS-PRISM defines a 12-dimensional perceptual schema, builds a targeted diagnostic dataset via adversarial synthesis and expert labels, and tunes an end-to-end model that outperforms generalist LLMs in human alignment...
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
eess.AS 2026-04 unverdicted novelty 6.0

A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...
Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation
cs.CL 2026-04 unverdicted novelty 6.0

SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 6.0

OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on mul...
FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts
cs.SD 2026-03 unverdicted novelty 6.0

FoleyDirector introduces structured temporal scripts and a fusion module to enable precise timing control in DiT-based video-to-audio generation while preserving audio fidelity.
Borderless Long Speech Synthesis
cs.SD 2026-03 unverdicted novelty 6.0

Borderless Long Speech Synthesis unifies voice design, multi-speaker TTS, and long-form generation via Global-Sentence-Token annotations, CoT reasoning, and a Structured Semantic Interface for agent-centric control.
Qwen3-Omni Technical Report
cs.CL 2025-09 unverdicted novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
Sema: Semantic Transport for Real-Time Multimodal Agents
cs.MM 2026-04 unverdicted novelty 5.0

Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while keeping multimodal agent task accuracy within 0.7 percentage points of raw baselines in WAN simulations.
Qwen3.5-Omni Technical Report
cs.CL 2026-04 unverdicted novelty 5.0

Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...
Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck
cs.SD 2026-04 unverdicted novelty 5.0

A singing voice conversion system with boundary-aware information bottleneck and high-frequency augmentation achieves the best naturalness in SVCC2025 subjective tests while using less extra data than competitors.
WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models
cs.CL 2026-03 unverdicted novelty 5.0

WAND adapts AR-TTS models to constant complexity via windowed attention and distillation, cutting KV cache memory by up to 66.2% while preserving quality and achieving length-invariant latency.
Qwen2.5-Omni Technical Report
cs.CL 2025-03 conditional novelty 5.0

Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...
Empowering Video Translation using Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 28 Pith papers · 4 internal anchors

[1]

Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V . Le, Yannis Agiomyrgian- nakis, Rob Clark, and Rif A. Saurous. Tacotron: Towards end-to-end speech synthesis. In INTERSPEECH, pages 4006–4010. ISCA, 2017

work page 2017
[2]

Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyr- giannakis, and Yonghui Wu. Natural TTS synthesis by conditioning wavenet on MEL spectro- gram predictions. In ICASSP, pages 4779–4783. IEEE, 2018

work page 2018
[3]

Deep voice 3: 2000-speaker neural text-to-speech

Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan ¨Omer Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. Deep voice 3: 2000-speaker neural text-to-speech. CoRR, abs/1710.07654, 2017

work page arXiv 2000
[4]

Clarinet: Parallel wave generation in end-to-end text-to-speech

Wei Ping, Kainan Peng, and Jitong Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. In ICLR (Poster). OpenReview.net, 2019

work page 2019
[5]

Fast- speech: Fast, robust and controllable text to speech

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fast- speech: Fast, robust and controllable text to speech. In NeurIPS, pages 3165–3174, 2019. 15

work page 2019
[6]

Neural speech synthesis with transformer network

Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. Neural speech synthesis with transformer network. In AAAI, pages 6706–6713. AAAI Press, 2019

work page 2019
[7]

Fastspeech 2: Fast and high-quality end-to-end text to speech

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. In ICLR. OpenReview.net, 2021

work page 2021
[8]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. Neural codec language models are zero-shot text to speech synthesizers. CoRR, abs/2301.02111, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Soundstream: An end-to-end neural audio codec

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022

work page 2022
[10]

High fidelity neural audio compression

Alexandre D ´efossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. Trans. Mach. Learn. Res., 2023, 2023

work page 2023
[11]

Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec

Zhihao Du, Shiliang Zhang, Kai Hu, and Siqi Zheng. Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec. In ICASSP, pages 591–595. IEEE, 2024

work page 2024
[12]

Speak, read and prompt: High-fidelity text-to-speech with minimal supervision

Eugene Kharitonov, Damien Vincent, Zal ´an Borsos, Rapha¨el Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, and Neil Zeghidour. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Trans. Assoc. Comput. Linguistics , 11:1703–1718, 2023

work page 2023
[13]

ELLA-V: stable neural codec language modeling with alignment-guided sequence reordering.CoRR, abs/2401.07333, 2024

Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, and Xie Chen. ELLA-V: stable neural codec language modeling with alignment-guided sequence reordering.CoRR, abs/2401.07333, 2024

work page arXiv 2024
[14]

V ALL-T: decoder-only generative transducer for robust and decoding- controllable text-to-speech

Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, Hui Zhang, Xie Chen, and Kai Yu. V ALL-T: decoder-only generative transducer for robust and decoding- controllable text-to-speech. CoRR, abs/2401.14321, 2024

work page arXiv 2024
[15]

RALL-E: robust codec language modeling with chain-of-thought prompting for text-to-speech synthesis.CoRR, abs/2404.03204, 2024

Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, and Sheng Zhao. RALL-E: robust codec language modeling with chain-of-thought prompting for text-to-speech synthesis.CoRR, abs/2404.03204, 2024

work page arXiv 2024
[16]

V ALL-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers

Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, and Furu Wei. V ALL-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers. CoRR, abs/2406.05370, 2024

work page arXiv 2024
[17]

Vall-e r: Robust and efficient zero-shot text- to-speech synthesis via monotonic alignment.arXiv preprint arXiv:2406.07855, 2024

Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, and Furu Wei. V ALL-E R: robust and efficient zero-shot text-to- speech synthesis via monotonic alignment. CoRR, abs/2406.07855, 2024

work page arXiv 2024
[18]

Maskgct: Zero-shot text-to-speech with masked generative codec transformer

Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Shunsi Zhang, and Zhizheng Wu. Maskgct: Zero-shot text-to-speech with masked generative codec transformer. CoRR, abs/2409.00750, 2024

work page arXiv 2024
[19]

Wavenext: Convnext-based fast neural vocoder without ISTFT layer

Takuma Okamoto, Haruki Yamashita, Yamato Ohtani, Tomoki Toda, and Hisashi Kawai. Wavenext: Convnext-based fast neural vocoder without ISTFT layer. In ASRU, pages 1–8. IEEE, 2023

work page 2023
[20]

V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis

Hubert Siuzdak. V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis. In ICLR. OpenReview.net, 2024

work page 2024
[21]

Autoregressive speech synthesis without vector quantization

Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, and Furu Wei. Autoregressive speech synthesis without vector quantization. CoRR, abs/2407.08551, 2024

work page arXiv 2024
[22]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , 2020

work page 2020
[23]

Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR. OpenReview.net, 2021. 16

work page 2021
[24]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In ICLR. OpenReview.net, 2023

work page 2023
[25]

V oicebox: Text- guided multilingual universal speech generation at scale

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. V oicebox: Text- guided multilingual universal speech generation at scale. In NeurIPS, 2023

work page 2023
[26]

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiangyang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, and Sheng Zhao. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. In ICML. OpenReview.net, 2024

work page 2024
[27]

V oiceflow: Efficient text-to- speech with rectified flow matching

Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, and Kai Yu. V oiceflow: Efficient text-to- speech with rectified flow matching. In ICASSP, pages 11121–11125. IEEE, 2024

work page 2024
[28]

Matcha-tts: A fast TTS architecture with conditional flow matching

Shivam Mehta, Ruibo Tu, Jonas Beskow, ´Eva Sz´ekely, and Gustav Eje Henter. Matcha-tts: A fast TTS architecture with conditional flow matching. In ICASSP, pages 11341–11345. IEEE, 2024

work page 2024
[29]

E3 TTS: easy end-to-end diffusion-based text to speech

Yuan Gao, Nobuyuki Morioka, Yu Zhang, and Nanxin Chen. E3 TTS: easy end-to-end diffusion-based text to speech. In ASRU, pages 1–8. IEEE, 2023

work page 2023
[30]

Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer

Keon Lee, Dong Won Kim, Jaehyeon Kim, and Jaewoong Cho. Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer. CoRR, abs/2406.11427, 2024

work page arXiv 2024
[31]

E2 TTS: embarrassingly easy fully non-autoregressive zero-shot TTS

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, and Naoyuki Kanda. E2 TTS: embarrassingly easy fully non-autoregressive zero-shot TTS. CoRR, abs/2406.18009, 2024

work page arXiv 2024
[32]

F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching. CoRR, abs/2410.06885, 2024

work page arXiv 2024
[33]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, J...

work page internal anchor Pith review arXiv 2024
[34]

Cosyvoice: A scalable multilingual zero-shot text- to-speech synthesizer based on supervised semantic tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan. Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. CoRR, abs/2407.05407, 2024

work page arXiv 2024
[35]

Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications.arXiv preprint arXiv:2409.03283, 2024

Haohan Guo, Kun Liu, Feiyu Shen, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kaituo Xu. Fireredtts: A foundation text-to-speech framework for industry-level generative speech appli- cations. CoRR, abs/2409.03283, 2024

work page arXiv 2024
[36]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Speechgpt: Empowering large language models with intrinsic cross-modal conversational abil- ities

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abil- ities. In EMNLP (Findings), pages 15757–15773. Association for Computational Linguistics, 2023

work page 2023
[38]

Tran, and Kazuhito Koishida

Trung Dang, David Aponte, Dung N. Tran, and Kazuhito Koishida. Livespeech: Low- latency zero-shot text-to-speech via autoregressive modeling of audio discrete codes. CoRR, abs/2406.02897, 2024

work page arXiv 2024
[39]

Tran, Tianyi Chen, and Kazuhito Koishida

Trung Dang, David Aponte, Dung N. Tran, Tianyi Chen, and Kazuhito Koishida. Zero-shot text-to-speech from continuous text streams. CoRR, abs/2410.00767, 2024. 17

work page arXiv 2024
[40]

BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100k hours of data

Mateusz Lajszczak, Guillermo C ´ambara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, ´Alvaro Mart´ın-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszynska, Haohan Guo, Bartosz Putrycz, Soledad L ´opez Gambino, Kayeon Yoo, Elena Sokolova, and Thomas Drugman. BASE TTS: lessons from building a billion- paramet...

work page arXiv 2024
[41]

Speak while you think: Streaming speech synthesis during text generation

Avihu Dekel, Slava Shechtman, Raul Fernandez, David Haws, Zvi Kons, and Ron Hoory. Speak while you think: Streaming speech synthesis during text generation. In ICASSP, pages 11931–11935. IEEE, 2024

work page 2024
[42]

Finite scalar quantization: VQ-V AE made simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: VQ-V AE made simple. InICLR. OpenReview.net, 2024

work page 2024
[43]

Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms

Tongyi Speech Team. Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms. arxiv, 2024

work page 2024
[44]

Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Ro- former: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

work page 2024
[45]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024

work page 2024
[46]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning , pages 8162–8171. PMLR, 2021

work page 2021
[48]

V oicebox: Text-guided mul- tilingual universal speech generation at scale

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, et al. V oicebox: Text-guided mul- tilingual universal speech generation at scale. Advances in neural information processing sys- tems, 36, 2024

work page 2024
[49]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In NeurIPS, 2023

work page 2023
[50]

Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition

Zhifu Gao, Shiliang Zhang, Ian McLoughlin, and Zhijie Yan. Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition. In Interspeech, pages 2063–2067. ISCA, 2022

work page 2063
[51]

Unicats: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding

Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, and Kai Yu. Unicats: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding. In AAAI, pages 17924–17932. AAAI Press, 2024

work page 2024
[52]

An enhanced res2net with local and global feature fusion for speaker verification

Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, and Jiajun Qi. An enhanced res2net with local and global feature fusion for speaker verification. In Interspeech. ISCA, 2023

work page 2023
[53]

Chandan K. A. Reddy, Vishak Gopal, and Ross Cutler. Dnsmos P.835: A non-intrusive percep- tual objective speech quality metric to evaluate noise suppressors. In ICASSP, pages 886–890. IEEE, 2022

work page 2022
[54]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Hon- olulu, Haw...

work page 2023
[55]

A large-scale evaluation of speech foundation models

Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, et al. A large-scale evaluation of speech foundation models. IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, 2024

work page 2024
[56]

2noise. Chattts. https://github.com/2noise/ChatTTS, 2024

work page 2024
[57]

Gpt-sovits

RVC-Boss. Gpt-sovits. https://github.com/RVC-Boss/GPT-SoVITS, 2024. 18

work page 2024
[58]

Openvoice: Versatile instant voice cloning

Zengyi Qin, Wenliang Zhao, Xumin Yu, and Xin Sun. Openvoice: Versatile instant voice cloning. CoRR, abs/2312.01479, 2023

work page arXiv 2023
[59]

Natural language guidance of high-fidelity text-to-speech with synthetic annotations

Daniel Lyth and Simon King. Natural language guidance of high-fidelity text-to-speech with synthetic annotations. CoRR, abs/2402.01912, 2024

work page arXiv 2024
[60]

Emotivoice

Netease Youdao. Emotivoice. https://github.com/netease-youdao/EmotiVoice, 2024. 19

work page 2024