pith. machine review for the scientific record. sign in

arxiv: 2505.17589 · v2 · submitted 2025-05-23 · 💻 cs.SD · cs.AI· eess.AS

Recognition: 3 theorem links

· Lean Theorem

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:22 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.AS
keywords speech synthesiszero-shotmultilingualscalingspeech tokenizerreward modelpost-trainingprosody
0
0 comments X

The pith

CosyVoice 3 improves zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and adding a multi-task tokenizer with a differentiable reward model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CosyVoice 3 to address limitations in language coverage, domain diversity, and post-training from its predecessor CosyVoice 2. It shows that expanding training data across nine languages and eighteen Chinese dialects, increasing model capacity, and using a new tokenizer trained on multiple supervised tasks produces higher content consistency, speaker similarity, and prosody naturalness in real-world conditions. A differentiable reward model is added for post-training that can apply to other language-model-based speech systems. These changes target the challenges of generating speech from varied text formats and acoustic environments without prior speaker or style references.

Core claim

CosyVoice 3 surpasses its predecessor in content consistency, speaker similarity, and prosody naturalness for zero-shot multilingual speech synthesis in the wild through dataset scaling to one million hours, model scaling to 1.5 billion parameters, a supervised multi-task speech tokenizer covering automatic speech recognition, emotion recognition, language identification, audio event detection and speaker analysis, plus a new differentiable reward model for post-training.

What carries the argument

The supervised multi-task speech tokenizer trained jointly on automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis, which supplies richer conditioning signals for prosody and consistency during generation.

If this is right

  • The differentiable reward model can be reused for post-training other LLM-based speech synthesis systems.
  • Larger data and model scales support synthesis across more domains and text formats while maintaining low-latency streaming.
  • The multi-task tokenizer enables better handling of prosody variation in zero-shot scenarios without explicit style references.
  • Performance gains appear on benchmarks covering nine languages and eighteen dialects under diverse acoustic conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The post-training reward approach may transfer to other audio generation tasks such as music or sound effects.
  • Further scaling beyond one million hours could continue to improve robustness if compute allows without introducing new failure modes.
  • The tokenizer's multi-task design suggests similar joint training could benefit related tasks like speech enhancement or diarization.

Load-bearing premise

That scaling data volume and model size together with the new tokenizer and reward model will deliver the reported gains in consistency and naturalness without overfitting or reduced generalization across unseen real-world conditions.

What would settle it

A controlled evaluation on a held-out set of multilingual wild recordings showing no statistically significant gains or outright drops in content consistency, speaker similarity, or prosody naturalness scores compared to CosyVoice 2.

read the original abstract

In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo at https://funaudiollm.github.io/cosyvoice3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents CosyVoice 3 as an advancement over CosyVoice 2 for zero-shot multilingual speech synthesis in the wild. It claims superior content consistency, speaker similarity, and prosody naturalness through four main contributions: a novel multi-task speech tokenizer trained on ASR, SER, LID, AED, and speaker analysis tasks; a differentiable reward model for post-training; scaling training data from 10k to 1M hours across 9 languages and 18 Chinese dialects; and scaling model parameters from 0.5B to 1.5B. The work emphasizes applicability to in-the-wild conditions and provides a demo link for subjective evaluation.

Significance. If the claimed gains are confirmed through rigorous, quantitative benchmarks with ablations, this work would meaningfully advance scalable LLM-based speech synthesis by demonstrating the benefits of combined data/model scaling and targeted post-training components. The introduction of a reusable differentiable reward model and the multi-task tokenizer represent potentially reusable contributions for the field.

major comments (2)
  1. [Abstract] Abstract: The central claim that CosyVoice 3 surpasses CosyVoice 2 in content consistency, speaker similarity, and prosody naturalness is stated without any quantitative metrics, baseline comparisons, error bars, or ablation results. This absence makes it impossible to assess whether the reported improvements are load-bearing or statistically meaningful, directly undermining evaluation of the scaling and post-training contributions.
  2. [Methods and Experiments] Methods and Experiments sections: The manuscript describes the multi-task tokenizer and differentiable reward model as key innovations but provides no details on how these components were validated against overfitting risks when scaling to 1M hours and 1.5B parameters. Specific ablation tables isolating the contribution of each (e.g., tokenizer vs. reward model vs. scale) are required to support the weakest assumption that the combination yields generalization gains in diverse real-world conditions.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from explicit cross-references to the specific benchmark datasets and evaluation protocols used for the multilingual results.
  2. [Demo and Figures] Figure captions and demo descriptions should clarify which audio samples correspond to zero-shot vs. few-shot conditions to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where the comments identify gaps in quantitative support and validation details, we have prepared revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that CosyVoice 3 surpasses CosyVoice 2 in content consistency, speaker similarity, and prosody naturalness is stated without any quantitative metrics, baseline comparisons, error bars, or ablation results. This absence makes it impossible to assess whether the reported improvements are load-bearing or statistically meaningful, directly undermining evaluation of the scaling and post-training contributions.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will update the abstract to report specific metrics, including WER reductions for content consistency, speaker embedding cosine similarity scores, and MOS improvements for prosody naturalness, with direct comparisons to CosyVoice 2. We will also reference the corresponding tables and note any available error bars from repeated evaluations. revision: yes

  2. Referee: [Methods and Experiments] Methods and Experiments sections: The manuscript describes the multi-task tokenizer and differentiable reward model as key innovations but provides no details on how these components were validated against overfitting risks when scaling to 1M hours and 1.5B parameters. Specific ablation tables isolating the contribution of each (e.g., tokenizer vs. reward model vs. scale) are required to support the weakest assumption that the combination yields generalization gains in diverse real-world conditions.

    Authors: We acknowledge the need for explicit validation details and finer-grained ablations. The current manuscript contains initial ablation results in the Experiments section, but to address overfitting concerns at scale we will add a new paragraph in Methods describing our procedures: use of a large held-out in-the-wild validation set, monitoring of training versus validation loss curves, and regularization techniques applied during the 1M-hour training. We will also expand the ablation tables to isolate the individual contributions of the multi-task tokenizer, differentiable reward model, data scaling, and model scaling, reporting results across the nine languages and 18 dialects under diverse real-world conditions. revision: yes

Circularity Check

1 steps flagged

Minor self-citation to CosyVoice 2 provides context but does not reduce central claims to inputs

specific steps
  1. self citation load bearing [Abstract]
    "In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, 1) A "

    The central positioning of CosyVoice 3 as surpassing CosyVoice 2 relies on a self-citation to prior work by overlapping authors, but the paper then introduces independent new elements (multi-task tokenizer, differentiable reward model, explicit scaling) whose performance gains are not derived from the cited model by definition or fit. The citation supplies only historical context rather than a load-bearing uniqueness theorem or ansatz that collapses the new results.

full rationale

The paper's claims rest on explicit new components (multi-task tokenizer via supervised training on ASR/SER/LID/AED/speaker tasks, differentiable reward model for post-training, data scaling to 1M hours across 9 languages/18 dialects, model scaling to 1.5B parameters) whose effects are described as independent increments over the prior CosyVoice 2 architecture. No equation, prediction, or uniqueness result is shown to reduce by construction to a fitted parameter or to the self-cited predecessor. The self-citation is limited to background and does not serve as the sole justification for the reported gains in consistency, similarity, or naturalness; external benchmarks and listening demos are referenced as validation. This yields a low but non-zero circularity score for the normal incremental self-reference.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim rests on the assumption that scaling data and model size plus the new tokenizer and reward model will yield the stated gains; these are treated as domain assumptions rather than derived results.

free parameters (2)
  • model parameter count = 1.5 billion
    Increased from 0.5 billion to 1.5 billion to enhance capacity
  • training data volume = one million hours
    Expanded from ten thousand hours to one million hours
axioms (2)
  • domain assumption Scaling laws observed in language models also apply to speech synthesis models
    Invoked to justify performance gains from larger data and model size
  • ad hoc to paper Multi-task supervised training on ASR, SER, LID, AED and speaker analysis produces a tokenizer that improves prosody naturalness
    Core justification for the novel tokenizer component
invented entities (2)
  • differentiable reward model no independent evidence
    purpose: Post-training refinement for LLM-based speech synthesis models
    New component introduced to improve output quality and claimed to generalize to other models
  • novel speech tokenizer no independent evidence
    purpose: Improve prosody naturalness via multi-task training
    Developed specifically through supervised training on multiple tasks

pith-pipeline@v0.9.0 · 5677 in / 1592 out tokens · 49339 ms · 2026-05-16T05:22:37.340405+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Key features of CosyVoice 3 include: 1) A novel speech tokenizer... 2) A new differentiable reward model... 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours... 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion

  • PhiForcing phi_equation unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness

  • DimensionForcing alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Through optimizing semantic token utilization, initializing with text-based LLMs, designing a bidirectional streaming scheme...

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VoxSafeBench: Not Just What Is Said, but Who, How, and Where

    cs.SD 2026-04 unverdicted novelty 8.0

    VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

  2. Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

    eess.AS 2026-05 unverdicted novelty 7.0

    GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

  3. VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

    cs.CL 2026-05 unverdicted novelty 7.0

    VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...

  4. Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

    cs.SD 2026-05 unverdicted novelty 7.0

    A new dataset, iterative coarse-to-fine localization framework, and segment-level IoU F1 metric tackle the open problem of detecting multiple unknown word-level inpainted regions in speech.

  5. SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding

    eess.AS 2026-04 unverdicted novelty 7.0

    Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.

  6. AST: Adaptive, Seamless, and Training-Free Precise Speech Editing

    cs.SD 2026-04 unverdicted novelty 7.0

    AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency wi...

  7. From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

    cs.AI 2026-04 unverdicted novelty 7.0

    ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.

  8. Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark

    cs.CL 2026-04 unverdicted novelty 7.0

    CAST benchmark shows language models infer correct word stress from discourse context but TTS systems frequently fail to produce it in speech.

  9. SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

    cs.IR 2026-02 unverdicted novelty 7.0

    SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme ac...

  10. TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

    cs.CL 2026-04 unverdicted novelty 6.0

    TTS-PRISM defines a 12-dimensional perceptual schema, builds a targeted diagnostic dataset via adversarial synthesis and expert labels, and tunes an end-to-end model that outperforms generalist LLMs in human alignment...

  11. UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

    cs.AI 2026-04 unverdicted novelty 6.0

    UAF is the first unified audio front-end LLM that turns multiple front-end tasks into one sequence prediction model processing streaming audio chunks and reference prompts to output semantic and control tokens for ful...

  12. MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

    cs.CL 2026-04 unverdicted novelty 6.0

    MoVE uses specialized LoRA expert adapters and a soft router to translate non-verbal vocalizations in S2ST, reproducing them in 76% of cases versus at most 14% for baselines while scoring highest on naturalness and em...

  13. Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use

    cs.SD 2026-04 unverdicted novelty 6.0

    Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.

  14. ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.

  15. OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on mul...

  16. FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts

    cs.SD 2026-03 unverdicted novelty 6.0

    FoleyDirector introduces structured temporal scripts and a fusion module to enable precise timing control in DiT-based video-to-audio generation while preserving audio fidelity.

  17. Qwen3-Omni Technical Report

    cs.CL 2025-09 unverdicted novelty 6.0

    Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...

  18. RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations

    eess.AS 2026-05 unverdicted novelty 5.0

    The RADAR Challenge 2026 provides a multilingual benchmark for audio deepfake detection under media transformations and finds that robust performance remains an open problem.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 18 Pith papers · 8 internal anchors

  1. [1]

    Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V . Le, Yannis Agiomyrgian- nakis, Rob Clark, and Rif A. Saurous. Tacotron: Towards end-to-end speech synthesis. In INTERSPEECH, pages 4006–4010. ISCA, 2017. 18

  2. [2]

    Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R

    Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyr- giannakis, and Yonghui Wu. Natural TTS synthesis by conditioning wavenet on MEL spectro- gram predictions. In ICASSP, pages 4779–4783. IEEE, 2018

  3. [3]

    Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

    Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan ¨Omer Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. Deep voice 3: 2000-speaker neural text-to-speech. CoRR, abs/1710.07654, 2017

  4. [4]

    Clarinet: Parallel wave generation in end-to-end text-to-speech

    Wei Ping, Kainan Peng, and Jitong Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. In ICLR (Poster). OpenReview.net, 2019

  5. [5]

    Fast- speech: Fast, robust and controllable text to speech

    Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fast- speech: Fast, robust and controllable text to speech. In NeurIPS, pages 3165–3174, 2019

  6. [6]

    Neural speech synthesis with transformer network

    Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. Neural speech synthesis with transformer network. In AAAI, pages 6706–6713. AAAI Press, 2019

  7. [7]

    Fastspeech 2: Fast and high-quality end-to-end text to speech

    Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. In ICLR. OpenReview.net, 2021

  8. [8]

    Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

    Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. Neural codec language models are zero-shot text to speech synthesizers. CoRR, abs/2301.02111, 2023

  9. [9]

    Speak, read and prompt: High-fidelity text-to-speech with minimal supervision

    Eugene Kharitonov, Damien Vincent, Zal ´an Borsos, Rapha¨el Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, and Neil Zeghidour. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Trans. Assoc. Comput. Linguistics , 11:1703–1718, 2023

  10. [10]

    ELLA-V: stable neural codec language modeling with alignment-guided sequence reordering.CoRR, abs/2401.07333, 2024

    Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, and Xie Chen. ELLA-V: stable neural codec language modeling with alignment-guided sequence reordering.CoRR, abs/2401.07333, 2024

  11. [11]

    V ALL-T: decoder-only generative transducer for robust and decoding- controllable text-to-speech

    Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, Hui Zhang, Xie Chen, and Kai Yu. V ALL-T: decoder-only generative transducer for robust and decoding- controllable text-to-speech. CoRR, abs/2401.14321, 2024

  12. [12]

    RALL-E: robust codec language modeling with chain-of-thought prompting for text-to-speech synthesis.CoRR, abs/2404.03204, 2024

    Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, and Sheng Zhao. RALL-E: robust codec language modeling with chain-of-thought prompting for text-to-speech synthesis.CoRR, abs/2404.03204, 2024

  13. [13]

    V ALL-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers

    Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, and Furu Wei. V ALL-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers. CoRR, abs/2406.05370, 2024

  14. [14]

    V ALL-E R: robust and efficient zero-shot text-to- speech synthesis via monotonic alignment

    Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, and Furu Wei. V ALL-E R: robust and efficient zero-shot text-to- speech synthesis via monotonic alignment. CoRR, abs/2406.07855, 2024

  15. [15]

    Maskgct: Zero-shot text-to-speech with masked generative codec transformer

    Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Shunsi Zhang, and Zhizheng Wu. Maskgct: Zero-shot text-to-speech with masked generative codec transformer. CoRR, abs/2409.00750, 2024

  16. [16]

    Autoregressive speech synthesis without vector quantization

    Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, and Furu Wei. Autoregressive speech synthesis without vector quantization. CoRR, abs/2407.08551, 2024

  17. [17]

    Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens

    Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, et al. Spark-tts: An efficient llm-based text-to- speech model with single-stream decoupled speech tokens. arXiv preprint arXiv:2503.01710, 2025

  18. [18]

    V oicebox: Text- guided multilingual universal speech generation at scale

    Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. V oicebox: Text- guided multilingual universal speech generation at scale. In NeurIPS, 2023

  19. [19]

    Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

    Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiangyang Li, Wei Ye, Shikun 19 Zhang, Jiang Bian, Lei He, Jinyu Li, and Sheng Zhao. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. In ICML. OpenReview.net, 2024

  20. [20]

    V oiceflow: Efficient text-to- speech with rectified flow matching

    Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, and Kai Yu. V oiceflow: Efficient text-to- speech with rectified flow matching. In ICASSP, pages 11121–11125. IEEE, 2024

  21. [21]

    Matcha-tts: A fast TTS architecture with conditional flow matching

    Shivam Mehta, Ruibo Tu, Jonas Beskow, ´Eva Sz´ekely, and Gustav Eje Henter. Matcha-tts: A fast TTS architecture with conditional flow matching. In ICASSP, pages 11341–11345. IEEE, 2024

  22. [22]

    E3 TTS: easy end-to-end diffusion-based text to speech

    Yuan Gao, Nobuyuki Morioka, Yu Zhang, and Nanxin Chen. E3 TTS: easy end-to-end diffusion-based text to speech. In ASRU, pages 1–8. IEEE, 2023

  23. [23]

    Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer

    Keon Lee, Dong Won Kim, Jaehyeon Kim, and Jaewoong Cho. Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer. CoRR, abs/2406.11427, 2024

  24. [24]

    E2 TTS: embarrassingly easy fully non-autoregressive zero-shot TTS

    Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, and Naoyuki Kanda. E2 TTS: embarrassingly easy fully non-autoregressive zero-shot TTS. CoRR, abs/2406.18009, 2024

  25. [25]

    F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

    Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching. CoRR, abs/2410.06885, 2024

  26. [26]

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, J...

  27. [27]

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan. Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. CoRR, abs/2407.05407, 2024

  28. [28]

    Touchtts: An embarrassingly simple tts framework that everyone can touch

    Xingchen Song, Mengtao Xing, Changwei Ma, Shengqiang Li, Di Wu, Binbin Zhang, Fuping Pan, Dinghao Zhou, Yuekai Zhang, Shun Lei, et al. Touchtts: An embarrassingly simple tts framework that everyone can touch. arXiv preprint arXiv:2412.08237, 2024

  29. [29]

    Fireredtts: A foundation text-to-speech framework for industry-level generative speech appli- cations

    Haohan Guo, Kun Liu, Feiyu Shen, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kaituo Xu. Fireredtts: A foundation text-to-speech framework for industry-level generative speech appli- cations. CoRR, abs/2409.03283, 2024

  30. [30]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117, 2024

  31. [31]

    Ditar: Diffusion transformer autoregressive mod- eling for speech generation

    Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, et al. Ditar: Diffusion transformer autoregressive mod- eling for speech generation. arXiv preprint arXiv:2502.03930, 2025

  32. [32]

    Indextts: An industrial-level controllable and efficient zero-shot text-to-speech system

    Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, and Lu Wang. Indextts: An industrial-level controllable and efficient zero-shot text-to-speech system. arXiv preprint arXiv:2502.05512 , 2025

  33. [33]

    Minmo: A multimodal large language model for seamless voice interaction

    Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, et al. Minmo: A multimodal large language model for seamless voice interaction. arXiv preprint arXiv:2501.06282, 2025

  34. [34]

    Finite scalar quantization: VQ-V AE made simple

    Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: VQ-V AE made simple. InICLR. OpenReview.net, 2024

  35. [35]

    Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms

    Tongyi Speech Team. Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms. arxiv, 2024. 20

  36. [36]

    Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Ro- former: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

  37. [37]

    F5r-tts: Improving flow matching based text-to-speech with group relative policy optimization

    Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, and Baoxun Wang. F5r-tts: Improving flow matching based text-to-speech with group relative policy optimization. arXiv preprint arXiv:2504.02407, 2025

  38. [38]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

  39. [39]

    Mossformer2: Combining transformer and rnn-free recurrent network for enhanced time-domain monaural speech separation, 2024

    Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Jiaqi Yip, Dianwen Ng, and Bin Ma. Mossformer2: Combining transformer and rnn-free recurrent network for enhanced time-domain monaural speech separation, 2024

  40. [40]

    Faster whisper large v3, 2023

    Systran. Faster whisper large v3, 2023

  41. [41]

    Canary 1b, 2024

    NVIDIA. Canary 1b, 2024

  42. [42]

    Seamlessm4t v2, 2023

    AI at Meta. Seamlessm4t v2, 2023

  43. [43]

    Montreal forced aligner: Trainable text-speech alignment using kaldi

    Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sondereg- ger. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech 2017, pages 498–502, 2017

  44. [44]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision , pages 4195–4205, 2023

  45. [45]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Hon- olulu, Haw...

  46. [46]

    Funasr: A fundamental end-to-end speech recognition toolkit

    Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, et al. Funasr: A fundamental end-to-end speech recognition toolkit. arXiv preprint arXiv:2305.11013, 2023

  47. [47]

    An en- hanced res2net with local and global feature fusion for speaker verification

    Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, and Jiajun Qi. An en- hanced res2net with local and global feature fusion for speaker verification. arXiv preprint arXiv:2305.12838, 2023

  48. [48]

    Chandan K. A. Reddy, Vishak Gopal, and Ross Cutler. Dnsmos P.835: A non-intrusive percep- tual objective speech quality metric to evaluate noise suppressors. In ICASSP, pages 886–890. IEEE, 2022

  49. [49]

    Large-scale self-supervised speech representation learning for automatic speaker verification

    Zhengyang Chen, Sanyuan Chen, Yu Wu, Yao Qian, Chengyi Wang, Shujie Liu, Yanmin Qian, and Michael Zeng. Large-scale self-supervised speech representation learning for automatic speaker verification. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 6147–6151. IEEE, 2022

  50. [50]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

  51. [51]

    Librispeech: an asr corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages 5206–5210. IEEE, 2015

  52. [52]

    Common voice: A massively-multilingual speech corpus

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019

  53. [53]

    Fleurs: Few-shot learning evaluation of universal representations of speech

    Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805. IEEE, 2023. 21

  54. [54]

    Emobox: Multilingual multi-corpus speech emotion recog- nition toolkit and benchmark

    Ziyang Ma, Mingjie Chen, Hezhao Zhang, Zhisheng Zheng, Wenxi Chen, Xiquan Li, Jiaxin Ye, Xie Chen, and Thomas Hain. Emobox: Multilingual multi-corpus speech emotion recog- nition toolkit and benchmark. In Proc. INTERSPEECH, 2024

  55. [55]

    Secap: Speech emotion captioning with large lan- guage model

    Yaoxun Xu, Hangting Chen, Jianwei Yu, Qiaochu Huang, Zhiyong Wu, Shixiong Zhang, Guangzhi Li, Yi Luo, and Rongzhi Gu. Secap: Speech emotion captioning with large lan- guage model. arXiv preprint arXiv:2312.10381, 2023

  56. [56]

    Soundstream: An end-to-end neural audio codec

    Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022

  57. [57]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdi- nov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

  58. [58]

    W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

    Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Under- standing Workshop (ASRU), pages 244–250. IEEE, 2021

  59. [59]

    A benchmark and analysis of discrete expressive speech resynthesis

    Tu Anh Nguyen, Wei-Ning Hsu, Antony D’Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, and Em- manuel Dupoux. A benchmark and analysis of discrete expressive speech resynthesis. arXiv preprint arXiv:2308.05725, 2023

  60. [60]

    Unis- peaker: A unified approach for multimodality-driven speaker generation

    Zhengyan Sheng, Zhihao Du, Heng Lu, Shiliang Zhang, and Zhen-Hua Ling. Unis- peaker: A unified approach for multimodality-driven speaker generation. arXiv preprint arXiv:2501.06394, 2025. 22