pith. machine review for the scientific record. sign in

arxiv: 2412.10117 · v3 · submitted 2024-12-13 · 💻 cs.SD · cs.AI· cs.LG· eess.AS

Recognition: 3 theorem links

· Lean Theorem

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:09 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.LGeess.AS
keywords speech synthesisstreaming TTSlarge language modelsflow matchingmultilingual speechfinite scalar quantizationcausal modelingvoice cloning
0
0 comments X

The pith

CosyVoice 2 reaches human-parity naturalness and near-zero latency in streaming speech synthesis via LLM optimizations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CosyVoice 2 as an updated multilingual speech synthesis system built on large language models. It adds finite-scalar quantization to raise speech token efficiency, streamlines the core model so a pre-trained LLM can serve directly as backbone, and introduces chunk-aware causal flow matching that runs both streaming and non-streaming modes inside one network. Trained on large-scale multilingual data, the system is reported to match human naturalness while keeping response latency minimal and synthesis quality virtually lossless even in real-time streaming. A reader would care because these changes target the latency and quality barriers that currently limit voice interfaces in live applications.

Core claim

CosyVoice 2 incorporates finite-scalar quantization to improve codebook utilization of speech tokens, streamlines the text-speech language model to allow direct use of a pre-trained LLM as backbone, and develops a chunk-aware causal flow matching model to support streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode.

What carries the argument

Chunk-aware causal flow matching model, which processes audio in chunks to enable streaming while preserving full quality alongside finite-scalar quantization for improved token efficiency.

Load-bearing premise

The listed changes in quantization, architecture streamlining, and causal flow matching are what produce the human-parity naturalness and lossless streaming results.

What would settle it

A controlled listening test in which raters score CosyVoice 2 streaming outputs against matched human recordings on naturalness and intelligibility, with average scores falling measurably below human parity.

read the original abstract

In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, language models (LMs) and Flow Matching, CosyVoice demonstrated high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, significant progress has been made in multi-modal large language models (LLMs), where the response latency and real-time factor of speech synthesis play a crucial role in the interactive experience. Therefore, in this report, we present an improved streaming speech synthesis model, CosyVoice 2, which incorporates comprehensive and systematic optimizations. Specifically, we introduce finite-scalar quantization to improve the codebook utilization of speech tokens. For the text-speech LM, we streamline the model architecture to allow direct use of a pre-trained LLM as the backbone. In addition, we develop a chunk-aware causal flow matching model to support various synthesis scenarios, enabling both streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode. We invite readers to listen to the demos at https://funaudiollm.github.io/cosyvoice2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents CosyVoice 2, an improved version of the prior CosyVoice model for multilingual speech synthesis. It incorporates finite-scalar quantization to enhance speech token codebook utilization, streamlines the text-speech language model to directly leverage a pre-trained LLM backbone, and introduces a chunk-aware causal flow matching model that supports both streaming and non-streaming synthesis in one architecture. Trained on a large-scale multilingual dataset, the work claims human-parity naturalness, minimal response latency, and virtually lossless quality specifically in streaming mode.

Significance. If the performance claims are substantiated, the work would represent a practical advance in low-latency, high-fidelity streaming TTS for interactive multimodal LLM applications, particularly by unifying streaming and non-streaming capabilities and improving token efficiency through the listed optimizations.

major comments (2)
  1. [Abstract] Abstract: The central claims of 'human-parity naturalness,' 'minimal response latency,' and 'virtually lossless synthesis quality' in streaming mode are asserted without any quantitative metrics, objective/subjective scores, baseline comparisons, ablation studies, or error analysis. This absence directly undermines evaluation of whether the finite-scalar quantization, streamlined LLM backbone, or chunk-aware causal flow matching produce the stated gains.
  2. [Architecture and Training sections] Architecture and Training sections: The descriptions of the three optimizations remain high-level narrative without equations, complexity analysis, or controlled experiments showing how each change (e.g., scalar quantization levels or chunk causality constraints) causally improves the reported metrics over the original CosyVoice.
minor comments (1)
  1. [Abstract] Abstract: The demo link is useful; however, the text should clarify the exact definition of 'virtually lossless' (e.g., with respect to which reference signal or perceptual metric).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each of the major comments below and indicate the revisions we plan to make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of 'human-parity naturalness,' 'minimal response latency,' and 'virtually lossless synthesis quality' in streaming mode are asserted without any quantitative metrics, objective/subjective scores, baseline comparisons, ablation studies, or error analysis. This absence directly undermines evaluation of whether the finite-scalar quantization, streamlined LLM backbone, or chunk-aware causal flow matching produce the stated gains.

    Authors: We acknowledge that the abstract does not contain specific numerical values, which is typical for abstracts to remain concise. The full paper contains extensive evaluation results in the Experiments section, including objective metrics like word error rate, speaker similarity scores, mean opinion scores (MOS) for naturalness, response latency measurements, comparisons against multiple baselines, and ablation studies on the individual components. To strengthen the abstract, we will add a few key quantitative results, such as the achieved MOS scores and latency values, to better substantiate the claims. revision: yes

  2. Referee: [Architecture and Training sections] Architecture and Training sections: The descriptions of the three optimizations remain high-level narrative without equations, complexity analysis, or controlled experiments showing how each change (e.g., scalar quantization levels or chunk causality constraints) causally improves the reported metrics over the original CosyVoice.

    Authors: The current descriptions aim to provide an accessible overview. We agree that adding more technical depth would be beneficial. In the revision, we will introduce equations for finite-scalar quantization, including the specific quantization levels and how they enhance codebook utilization compared to the previous approach. For the streamlined LLM, we will include details on the architecture modifications, parameter counts, and a complexity analysis. For the chunk-aware causal flow matching, we will provide the formulation of the causal mechanism and chunk processing. Furthermore, we will enhance the ablation studies to more clearly demonstrate the contribution of each optimization through controlled comparisons to the original CosyVoice model, showing improvements in the relevant metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an engineering progression from prior CosyVoice work through new optimizations (finite-scalar quantization, streamlined LLM backbone, chunk-aware causal flow matching) trained on large-scale multilingual data, with performance claims resting on empirical results rather than any closed-form derivation. No equations, fitted parameters renamed as predictions, or self-citation chains reduce the central claims to inputs by construction; the self-reference to previous work is purely contextual and not load-bearing for the reported human-parity or lossless outcomes.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions plus several unstated modeling choices typical of LLM-based TTS; no new physical axioms or invented entities are introduced.

free parameters (2)
  • quantization codebook size and scalar levels
    Finite-scalar quantization parameters are chosen to improve utilization; exact values not given in abstract but required for the token efficiency claim.
  • chunk size and causality constraints
    Hyperparameters controlling streaming chunk length and causal masking are fitted or tuned to achieve low latency.
axioms (2)
  • domain assumption Discrete speech tokens from supervised training capture sufficient prosody and content for high-quality synthesis
    Invoked when stating that progressive semantic decoding yields human-parity naturalness.
  • domain assumption Pre-trained LLM weights transfer effectively to text-to-speech token prediction without major retraining
    Used when streamlining the model to directly employ a pre-trained LLM backbone.

pith-pipeline@v0.9.0 · 5607 in / 1461 out tokens · 61062 ms · 2026-05-13T06:09:01.262029+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

    cs.SD 2026-05 unverdicted novelty 7.0

    AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.

  2. How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

    cs.CL 2026-05 unverdicted novelty 7.0

    Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...

  3. Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

    eess.AS 2026-05 unverdicted novelty 7.0

    GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

  4. VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

    cs.CL 2026-05 unverdicted novelty 7.0

    VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...

  5. Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation

    cs.SD 2026-05 unverdicted novelty 7.0

    Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.

  6. MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

    eess.AS 2026-04 unverdicted novelty 7.0

    MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.

  7. NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

    cs.SD 2026-04 unverdicted novelty 7.0

    NVBench provides a standardized bilingual benchmark and evaluation protocol for assessing non-verbal vocalization generation, placement, and salience in text-to-speech systems.

  8. AST: Adaptive, Seamless, and Training-Free Precise Speech Editing

    cs.SD 2026-04 unverdicted novelty 7.0

    AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency wi...

  9. CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing

    cs.SD 2026-04 unverdicted novelty 7.0

    CoSyncDiT is a cognitive-inspired diffusion transformer that achieves state-of-the-art lip synchronization and naturalness in movie dubbing by guiding noise-to-speech generation through acoustic, visual, and contextua...

  10. CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

    cs.SD 2026-04 unverdicted novelty 7.0

    CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

  11. The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

    eess.AS 2026-03 unverdicted novelty 7.0

    FLAIR enables spoken dialogue AI to conduct continuous latent reasoning while perceiving speech through recursive latent embeddings and an ELBO-based finetuning objective.

  12. AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling

    cs.SD 2026-05 unverdicted novelty 6.0

    AuDirector is a self-reflective closed-loop multi-agent framework that generates immersive audio narratives with improved structural coherence, emotional expressiveness, and acoustic fidelity via identity-aware voice ...

  13. Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation

    eess.AS 2026-05 unverdicted novelty 6.0

    L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.

  14. CASCADE: Context-Aware Relaxation for Speculative Image Decoding

    cs.CV 2026-05 unverdicted novelty 6.0

    CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...

  15. TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

    cs.CL 2026-04 unverdicted novelty 6.0

    TTS-PRISM defines a 12-dimensional perceptual schema, builds a targeted diagnostic dataset via adversarial synthesis and expert labels, and tunes an end-to-end model that outperforms generalist LLMs in human alignment...

  16. Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization

    eess.AS 2026-04 unverdicted novelty 6.0

    A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...

  17. Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation

    cs.CL 2026-04 unverdicted novelty 6.0

    SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...

  18. ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.

  19. OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on mul...

  20. FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts

    cs.SD 2026-03 unverdicted novelty 6.0

    FoleyDirector introduces structured temporal scripts and a fusion module to enable precise timing control in DiT-based video-to-audio generation while preserving audio fidelity.

  21. Borderless Long Speech Synthesis

    cs.SD 2026-03 unverdicted novelty 6.0

    Borderless Long Speech Synthesis unifies voice design, multi-speaker TTS, and long-form generation via Global-Sentence-Token annotations, CoT reasoning, and a Structured Semantic Interface for agent-centric control.

  22. Qwen3-Omni Technical Report

    cs.CL 2025-09 unverdicted novelty 6.0

    Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...

  23. Sema: Semantic Transport for Real-Time Multimodal Agents

    cs.MM 2026-04 unverdicted novelty 5.0

    Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while keeping multimodal agent task accuracy within 0.7 percentage points of raw baselines in WAN simulations.

  24. Qwen3.5-Omni Technical Report

    cs.CL 2026-04 unverdicted novelty 5.0

    Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...

  25. Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck

    cs.SD 2026-04 unverdicted novelty 5.0

    A singing voice conversion system with boundary-aware information bottleneck and high-frequency augmentation achieves the best naturalness in SVCC2025 subjective tests while using less extra data than competitors.

  26. WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

    cs.CL 2026-03 unverdicted novelty 5.0

    WAND adapts AR-TTS models to constant complexity via windowed attention and distillation, cutting KV cache memory by up to 66.2% while preserving quality and achieving length-invariant latency.

  27. Qwen2.5-Omni Technical Report

    cs.CL 2025-03 conditional novelty 5.0

    Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...

  28. Empowering Video Translation using Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 4.0

    The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 28 Pith papers · 4 internal anchors

  1. [1]

    Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V . Le, Yannis Agiomyrgian- nakis, Rob Clark, and Rif A. Saurous. Tacotron: Towards end-to-end speech synthesis. In INTERSPEECH, pages 4006–4010. ISCA, 2017

  2. [2]

    Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R

    Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyr- giannakis, and Yonghui Wu. Natural TTS synthesis by conditioning wavenet on MEL spectro- gram predictions. In ICASSP, pages 4779–4783. IEEE, 2018

  3. [3]

    Deep voice 3: 2000-speaker neural text-to-speech

    Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan ¨Omer Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. Deep voice 3: 2000-speaker neural text-to-speech. CoRR, abs/1710.07654, 2017

  4. [4]

    Clarinet: Parallel wave generation in end-to-end text-to-speech

    Wei Ping, Kainan Peng, and Jitong Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. In ICLR (Poster). OpenReview.net, 2019

  5. [5]

    Fast- speech: Fast, robust and controllable text to speech

    Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fast- speech: Fast, robust and controllable text to speech. In NeurIPS, pages 3165–3174, 2019. 15

  6. [6]

    Neural speech synthesis with transformer network

    Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. Neural speech synthesis with transformer network. In AAAI, pages 6706–6713. AAAI Press, 2019

  7. [7]

    Fastspeech 2: Fast and high-quality end-to-end text to speech

    Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. In ICLR. OpenReview.net, 2021

  8. [8]

    Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

    Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. Neural codec language models are zero-shot text to speech synthesizers. CoRR, abs/2301.02111, 2023

  9. [9]

    Soundstream: An end-to-end neural audio codec

    Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022

  10. [10]

    High fidelity neural audio compression

    Alexandre D ´efossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. Trans. Mach. Learn. Res., 2023, 2023

  11. [11]

    Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec

    Zhihao Du, Shiliang Zhang, Kai Hu, and Siqi Zheng. Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec. In ICASSP, pages 591–595. IEEE, 2024

  12. [12]

    Speak, read and prompt: High-fidelity text-to-speech with minimal supervision

    Eugene Kharitonov, Damien Vincent, Zal ´an Borsos, Rapha¨el Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, and Neil Zeghidour. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Trans. Assoc. Comput. Linguistics , 11:1703–1718, 2023

  13. [13]

    ELLA-V: stable neural codec language modeling with alignment-guided sequence reordering.CoRR, abs/2401.07333, 2024

    Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, and Xie Chen. ELLA-V: stable neural codec language modeling with alignment-guided sequence reordering.CoRR, abs/2401.07333, 2024

  14. [14]

    V ALL-T: decoder-only generative transducer for robust and decoding- controllable text-to-speech

    Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, Hui Zhang, Xie Chen, and Kai Yu. V ALL-T: decoder-only generative transducer for robust and decoding- controllable text-to-speech. CoRR, abs/2401.14321, 2024

  15. [15]

    RALL-E: robust codec language modeling with chain-of-thought prompting for text-to-speech synthesis.CoRR, abs/2404.03204, 2024

    Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, and Sheng Zhao. RALL-E: robust codec language modeling with chain-of-thought prompting for text-to-speech synthesis.CoRR, abs/2404.03204, 2024

  16. [16]

    V ALL-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers

    Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, and Furu Wei. V ALL-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers. CoRR, abs/2406.05370, 2024

  17. [17]

    Vall-e r: Robust and efficient zero-shot text- to-speech synthesis via monotonic alignment.arXiv preprint arXiv:2406.07855, 2024

    Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, and Furu Wei. V ALL-E R: robust and efficient zero-shot text-to- speech synthesis via monotonic alignment. CoRR, abs/2406.07855, 2024

  18. [18]

    Maskgct: Zero-shot text-to-speech with masked generative codec transformer

    Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Shunsi Zhang, and Zhizheng Wu. Maskgct: Zero-shot text-to-speech with masked generative codec transformer. CoRR, abs/2409.00750, 2024

  19. [19]

    Wavenext: Convnext-based fast neural vocoder without ISTFT layer

    Takuma Okamoto, Haruki Yamashita, Yamato Ohtani, Tomoki Toda, and Hisashi Kawai. Wavenext: Convnext-based fast neural vocoder without ISTFT layer. In ASRU, pages 1–8. IEEE, 2023

  20. [20]

    V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis

    Hubert Siuzdak. V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis. In ICLR. OpenReview.net, 2024

  21. [21]

    Autoregressive speech synthesis without vector quantization

    Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, and Furu Wei. Autoregressive speech synthesis without vector quantization. CoRR, abs/2407.08551, 2024

  22. [22]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , 2020

  23. [23]

    Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR. OpenReview.net, 2021. 16

  24. [24]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In ICLR. OpenReview.net, 2023

  25. [25]

    V oicebox: Text- guided multilingual universal speech generation at scale

    Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. V oicebox: Text- guided multilingual universal speech generation at scale. In NeurIPS, 2023

  26. [26]

    Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

    Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiangyang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, and Sheng Zhao. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. In ICML. OpenReview.net, 2024

  27. [27]

    V oiceflow: Efficient text-to- speech with rectified flow matching

    Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, and Kai Yu. V oiceflow: Efficient text-to- speech with rectified flow matching. In ICASSP, pages 11121–11125. IEEE, 2024

  28. [28]

    Matcha-tts: A fast TTS architecture with conditional flow matching

    Shivam Mehta, Ruibo Tu, Jonas Beskow, ´Eva Sz´ekely, and Gustav Eje Henter. Matcha-tts: A fast TTS architecture with conditional flow matching. In ICASSP, pages 11341–11345. IEEE, 2024

  29. [29]

    E3 TTS: easy end-to-end diffusion-based text to speech

    Yuan Gao, Nobuyuki Morioka, Yu Zhang, and Nanxin Chen. E3 TTS: easy end-to-end diffusion-based text to speech. In ASRU, pages 1–8. IEEE, 2023

  30. [30]

    Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer

    Keon Lee, Dong Won Kim, Jaehyeon Kim, and Jaewoong Cho. Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer. CoRR, abs/2406.11427, 2024

  31. [31]

    E2 TTS: embarrassingly easy fully non-autoregressive zero-shot TTS

    Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, and Naoyuki Kanda. E2 TTS: embarrassingly easy fully non-autoregressive zero-shot TTS. CoRR, abs/2406.18009, 2024

  32. [32]

    F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

    Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching. CoRR, abs/2410.06885, 2024

  33. [33]

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, J...

  34. [34]

    Cosyvoice: A scalable multilingual zero-shot text- to-speech synthesizer based on supervised semantic tokens

    Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan. Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. CoRR, abs/2407.05407, 2024

  35. [35]

    Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications.arXiv preprint arXiv:2409.03283, 2024

    Haohan Guo, Kun Liu, Feiyu Shen, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kaituo Xu. Fireredtts: A foundation text-to-speech framework for industry-level generative speech appli- cations. CoRR, abs/2409.03283, 2024

  36. [36]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  37. [37]

    Speechgpt: Empowering large language models with intrinsic cross-modal conversational abil- ities

    Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abil- ities. In EMNLP (Findings), pages 15757–15773. Association for Computational Linguistics, 2023

  38. [38]

    Tran, and Kazuhito Koishida

    Trung Dang, David Aponte, Dung N. Tran, and Kazuhito Koishida. Livespeech: Low- latency zero-shot text-to-speech via autoregressive modeling of audio discrete codes. CoRR, abs/2406.02897, 2024

  39. [39]

    Tran, Tianyi Chen, and Kazuhito Koishida

    Trung Dang, David Aponte, Dung N. Tran, Tianyi Chen, and Kazuhito Koishida. Zero-shot text-to-speech from continuous text streams. CoRR, abs/2410.00767, 2024. 17

  40. [40]

    BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100k hours of data

    Mateusz Lajszczak, Guillermo C ´ambara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, ´Alvaro Mart´ın-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszynska, Haohan Guo, Bartosz Putrycz, Soledad L ´opez Gambino, Kayeon Yoo, Elena Sokolova, and Thomas Drugman. BASE TTS: lessons from building a billion- paramet...

  41. [41]

    Speak while you think: Streaming speech synthesis during text generation

    Avihu Dekel, Slava Shechtman, Raul Fernandez, David Haws, Zvi Kons, and Ron Hoory. Speak while you think: Streaming speech synthesis during text generation. In ICASSP, pages 11931–11935. IEEE, 2024

  42. [42]

    Finite scalar quantization: VQ-V AE made simple

    Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: VQ-V AE made simple. InICLR. OpenReview.net, 2024

  43. [43]

    Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms

    Tongyi Speech Team. Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms. arxiv, 2024

  44. [44]

    Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Ro- former: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

  45. [45]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024

  46. [46]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

  47. [47]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning , pages 8162–8171. PMLR, 2021

  48. [48]

    V oicebox: Text-guided mul- tilingual universal speech generation at scale

    Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, et al. V oicebox: Text-guided mul- tilingual universal speech generation at scale. Advances in neural information processing sys- tems, 36, 2024

  49. [49]

    Manning, Stefano Ermon, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In NeurIPS, 2023

  50. [50]

    Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition

    Zhifu Gao, Shiliang Zhang, Ian McLoughlin, and Zhijie Yan. Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition. In Interspeech, pages 2063–2067. ISCA, 2022

  51. [51]

    Unicats: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding

    Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, and Kai Yu. Unicats: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding. In AAAI, pages 17924–17932. AAAI Press, 2024

  52. [52]

    An enhanced res2net with local and global feature fusion for speaker verification

    Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, and Jiajun Qi. An enhanced res2net with local and global feature fusion for speaker verification. In Interspeech. ISCA, 2023

  53. [53]

    Chandan K. A. Reddy, Vishak Gopal, and Ross Cutler. Dnsmos P.835: A non-intrusive percep- tual objective speech quality metric to evaluate noise suppressors. In ICASSP, pages 886–890. IEEE, 2022

  54. [54]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Hon- olulu, Haw...

  55. [55]

    A large-scale evaluation of speech foundation models

    Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, et al. A large-scale evaluation of speech foundation models. IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, 2024

  56. [56]

    2noise. Chattts. https://github.com/2noise/ChatTTS, 2024

  57. [57]

    Gpt-sovits

    RVC-Boss. Gpt-sovits. https://github.com/RVC-Boss/GPT-SoVITS, 2024. 18

  58. [58]

    Openvoice: Versatile instant voice cloning

    Zengyi Qin, Wenliang Zhao, Xumin Yu, and Xin Sun. Openvoice: Versatile instant voice cloning. CoRR, abs/2312.01479, 2023

  59. [59]

    Natural language guidance of high-fidelity text-to-speech with synthetic annotations

    Daniel Lyth and Simon King. Natural language guidance of high-fidelity text-to-speech with synthetic annotations. CoRR, abs/2402.01912, 2024

  60. [60]

    Emotivoice

    Netease Youdao. Emotivoice. https://github.com/netease-youdao/EmotiVoice, 2024. 19