arxiv: 2505.17589 · v2 · submitted 2025-05-23 · 💻 cs.SD · cs.AI· eess.AS

Recognition: 3 theorem links

· Lean Theorem

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Zhihao Du , Changfeng Gao , Yuxuan Wang , Fan Yu , Tianyu Zhao , Hao Wang , Xiang Lv , Hui Wang

show 14 more authors

Chongjia Ni Xian Shi Keyu An Guanrou Yang Yabin Li Yanni Chen Zhifu Gao Qian Chen Yue Gu Mengzhe Chen Yafeng Chen Shiliang Zhang Wen Wang Jieping Ye

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:22 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.AS

keywords speech synthesiszero-shotmultilingualscalingspeech tokenizerreward modelpost-trainingprosody

0 comments

The pith

CosyVoice 3 improves zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and adding a multi-task tokenizer with a differentiable reward model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CosyVoice 3 to address limitations in language coverage, domain diversity, and post-training from its predecessor CosyVoice 2. It shows that expanding training data across nine languages and eighteen Chinese dialects, increasing model capacity, and using a new tokenizer trained on multiple supervised tasks produces higher content consistency, speaker similarity, and prosody naturalness in real-world conditions. A differentiable reward model is added for post-training that can apply to other language-model-based speech systems. These changes target the challenges of generating speech from varied text formats and acoustic environments without prior speaker or style references.

Core claim

CosyVoice 3 surpasses its predecessor in content consistency, speaker similarity, and prosody naturalness for zero-shot multilingual speech synthesis in the wild through dataset scaling to one million hours, model scaling to 1.5 billion parameters, a supervised multi-task speech tokenizer covering automatic speech recognition, emotion recognition, language identification, audio event detection and speaker analysis, plus a new differentiable reward model for post-training.

What carries the argument

The supervised multi-task speech tokenizer trained jointly on automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis, which supplies richer conditioning signals for prosody and consistency during generation.

If this is right

The differentiable reward model can be reused for post-training other LLM-based speech synthesis systems.
Larger data and model scales support synthesis across more domains and text formats while maintaining low-latency streaming.
The multi-task tokenizer enables better handling of prosody variation in zero-shot scenarios without explicit style references.
Performance gains appear on benchmarks covering nine languages and eighteen dialects under diverse acoustic conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The post-training reward approach may transfer to other audio generation tasks such as music or sound effects.
Further scaling beyond one million hours could continue to improve robustness if compute allows without introducing new failure modes.
The tokenizer's multi-task design suggests similar joint training could benefit related tasks like speech enhancement or diarization.

Load-bearing premise

That scaling data volume and model size together with the new tokenizer and reward model will deliver the reported gains in consistency and naturalness without overfitting or reduced generalization across unseen real-world conditions.

What would settle it

A controlled evaluation on a held-out set of multilingual wild recordings showing no statistically significant gains or outright drops in content consistency, speaker similarity, or prosody naturalness scores compared to CosyVoice 2.

read the original abstract

In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo at https://funaudiollm.github.io/cosyvoice3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CosyVoice 3 scales data to 1M hours and parameters to 1.5B while adding a multi-task tokenizer and differentiable reward model, but the abstract supplies no metrics to support the claimed gains.

read the letter

CosyVoice 3 takes the CosyVoice 2 baseline and scales training data from ten thousand to one million hours across nine languages and eighteen Chinese dialects, while growing the model from 0.5B to 1.5B parameters. It also adds two concrete pieces: a supervised multi-task speech tokenizer trained jointly on ASR, emotion recognition, language identification, audio event detection, and speaker analysis, plus a differentiable reward model for post-training that the authors say can be reused on other LLM-based synthesizers. These are the actual increments over the prior work. The paper states the new tokenizer targets prosody naturalness and that the overall changes improve content consistency, speaker similarity, and prosody in zero-shot in-the-wild settings. The description of the scaling regime and the listed limitations of CosyVoice 2 is straightforward and easy to follow. The demo link is provided for listening checks. The soft spot is the total lack of numbers. The abstract asserts better performance but shows no scores, no baselines, no ablations, and no error bars. Without those, it is impossible to judge whether the tokenizer or the reward model moved the needle or whether the scaling simply followed the usual pattern. The assumption that one million hours plus the new components will deliver gains without overfitting or loss of generalization in real conditions is reasonable on paper but remains untested in the text. This work is for groups already building LLM-based speech systems who want the latest scaling recipe and the tokenizer details. A reader can extract the architecture changes and try the components, but anyone needing evidence will have to examine the full results or the samples. I would send it to peer review. The new tokenizer and reward model are worth referee time even if the scaling claims require more data to hold up.

Referee Report

2 major / 2 minor

Summary. The paper presents CosyVoice 3 as an advancement over CosyVoice 2 for zero-shot multilingual speech synthesis in the wild. It claims superior content consistency, speaker similarity, and prosody naturalness through four main contributions: a novel multi-task speech tokenizer trained on ASR, SER, LID, AED, and speaker analysis tasks; a differentiable reward model for post-training; scaling training data from 10k to 1M hours across 9 languages and 18 Chinese dialects; and scaling model parameters from 0.5B to 1.5B. The work emphasizes applicability to in-the-wild conditions and provides a demo link for subjective evaluation.

Significance. If the claimed gains are confirmed through rigorous, quantitative benchmarks with ablations, this work would meaningfully advance scalable LLM-based speech synthesis by demonstrating the benefits of combined data/model scaling and targeted post-training components. The introduction of a reusable differentiable reward model and the multi-task tokenizer represent potentially reusable contributions for the field.

major comments (2)

[Abstract] Abstract: The central claim that CosyVoice 3 surpasses CosyVoice 2 in content consistency, speaker similarity, and prosody naturalness is stated without any quantitative metrics, baseline comparisons, error bars, or ablation results. This absence makes it impossible to assess whether the reported improvements are load-bearing or statistically meaningful, directly undermining evaluation of the scaling and post-training contributions.
[Methods and Experiments] Methods and Experiments sections: The manuscript describes the multi-task tokenizer and differentiable reward model as key innovations but provides no details on how these components were validated against overfitting risks when scaling to 1M hours and 1.5B parameters. Specific ablation tables isolating the contribution of each (e.g., tokenizer vs. reward model vs. scale) are required to support the weakest assumption that the combination yields generalization gains in diverse real-world conditions.

minor comments (2)

[Abstract] The abstract and introduction would benefit from explicit cross-references to the specific benchmark datasets and evaluation protocols used for the multilingual results.
[Demo and Figures] Figure captions and demo descriptions should clarify which audio samples correspond to zero-shot vs. few-shot conditions to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where the comments identify gaps in quantitative support and validation details, we have prepared revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that CosyVoice 3 surpasses CosyVoice 2 in content consistency, speaker similarity, and prosody naturalness is stated without any quantitative metrics, baseline comparisons, error bars, or ablation results. This absence makes it impossible to assess whether the reported improvements are load-bearing or statistically meaningful, directly undermining evaluation of the scaling and post-training contributions.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will update the abstract to report specific metrics, including WER reductions for content consistency, speaker embedding cosine similarity scores, and MOS improvements for prosody naturalness, with direct comparisons to CosyVoice 2. We will also reference the corresponding tables and note any available error bars from repeated evaluations. revision: yes
Referee: [Methods and Experiments] Methods and Experiments sections: The manuscript describes the multi-task tokenizer and differentiable reward model as key innovations but provides no details on how these components were validated against overfitting risks when scaling to 1M hours and 1.5B parameters. Specific ablation tables isolating the contribution of each (e.g., tokenizer vs. reward model vs. scale) are required to support the weakest assumption that the combination yields generalization gains in diverse real-world conditions.

Authors: We acknowledge the need for explicit validation details and finer-grained ablations. The current manuscript contains initial ablation results in the Experiments section, but to address overfitting concerns at scale we will add a new paragraph in Methods describing our procedures: use of a large held-out in-the-wild validation set, monitoring of training versus validation loss curves, and regularization techniques applied during the 1M-hour training. We will also expand the ablation tables to isolate the individual contributions of the multi-task tokenizer, differentiable reward model, data scaling, and model scaling, reporting results across the nine languages and 18 dialects under diverse real-world conditions. revision: yes

Circularity Check

1 steps flagged

Minor self-citation to CosyVoice 2 provides context but does not reduce central claims to inputs

specific steps

self citation load bearing [Abstract]
"In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, 1) A "

The central positioning of CosyVoice 3 as surpassing CosyVoice 2 relies on a self-citation to prior work by overlapping authors, but the paper then introduces independent new elements (multi-task tokenizer, differentiable reward model, explicit scaling) whose performance gains are not derived from the cited model by definition or fit. The citation supplies only historical context rather than a load-bearing uniqueness theorem or ansatz that collapses the new results.

full rationale

The paper's claims rest on explicit new components (multi-task tokenizer via supervised training on ASR/SER/LID/AED/speaker tasks, differentiable reward model for post-training, data scaling to 1M hours across 9 languages/18 dialects, model scaling to 1.5B parameters) whose effects are described as independent increments over the prior CosyVoice 2 architecture. No equation, prediction, or uniqueness result is shown to reduce by construction to a fitted parameter or to the self-cited predecessor. The self-citation is limited to background and does not serve as the sole justification for the reported gains in consistency, similarity, or naturalness; external benchmarks and listening demos are referenced as validation. This yields a low but non-zero circularity score for the normal incremental self-reference.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim rests on the assumption that scaling data and model size plus the new tokenizer and reward model will yield the stated gains; these are treated as domain assumptions rather than derived results.

free parameters (2)

model parameter count = 1.5 billion
Increased from 0.5 billion to 1.5 billion to enhance capacity
training data volume = one million hours
Expanded from ten thousand hours to one million hours

axioms (2)

domain assumption Scaling laws observed in language models also apply to speech synthesis models
Invoked to justify performance gains from larger data and model size
ad hoc to paper Multi-task supervised training on ASR, SER, LID, AED and speaker analysis produces a tokenizer that improves prosody naturalness
Core justification for the novel tokenizer component

invented entities (2)

differentiable reward model no independent evidence
purpose: Post-training refinement for LLM-based speech synthesis models
New component introduced to improve output quality and claimed to generalize to other models
novel speech tokenizer no independent evidence
purpose: Improve prosody naturalness via multi-task training
Developed specifically through supervised training on multiple tasks

pith-pipeline@v0.9.0 · 5677 in / 1592 out tokens · 49339 ms · 2026-05-16T05:22:37.340405+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Key features of CosyVoice 3 include: 1) A novel speech tokenizer... 2) A new differentiable reward model... 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours... 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion
PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness
DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Through optimizing semantic token utilization, initializing with text-based LLMs, designing a bidirectional streaming scheme...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where
cs.SD 2026-04 unverdicted novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
eess.AS 2026-05 unverdicted novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
cs.CL 2026-05 unverdicted novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization
cs.SD 2026-05 unverdicted novelty 7.0

A new dataset, iterative coarse-to-fine localization framework, and segment-level IoU F1 metric tackle the open problem of detecting multiple unknown word-level inpainted regions in speech.
SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding
eess.AS 2026-04 unverdicted novelty 7.0

Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.
AST: Adaptive, Seamless, and Training-Free Precise Speech Editing
cs.SD 2026-04 unverdicted novelty 7.0

AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency wi...
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench
cs.AI 2026-04 unverdicted novelty 7.0

ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark
cs.CL 2026-04 unverdicted novelty 7.0

CAST benchmark shows language models infer correct word stress from discourse context but TTS systems frequently fail to produce it in speech.
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise
cs.IR 2026-02 unverdicted novelty 7.0

SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme ac...
TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis
cs.CL 2026-04 unverdicted novelty 6.0

TTS-PRISM defines a 12-dimensional perceptual schema, builds a targeted diagnostic dataset via adversarial synthesis and expert labels, and tunes an end-to-end model that outperforms generalist LLMs in human alignment...
UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction
cs.AI 2026-04 unverdicted novelty 6.0

UAF is the first unified audio front-end LLM that turns multiple front-end tasks into one sequence prediction model processing streaming audio chunks and reference prompts to output semantic and control tokens for ful...
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation
cs.CL 2026-04 unverdicted novelty 6.0

MoVE uses specialized LoRA expert adapters and a soft router to translate non-verbal vocalizations in S2ST, reproducing them in 76% of cases versus at most 14% for baselines while scoring highest on naturalness and em...
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
cs.SD 2026-04 unverdicted novelty 6.0

Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 6.0

OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on mul...
FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts
cs.SD 2026-03 unverdicted novelty 6.0

FoleyDirector introduces structured temporal scripts and a fusion module to enable precise timing control in DiT-based video-to-audio generation while preserving audio fidelity.
Qwen3-Omni Technical Report
cs.CL 2025-09 unverdicted novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations
eess.AS 2026-05 unverdicted novelty 5.0

The RADAR Challenge 2026 provides a multilingual benchmark for audio deepfake detection under media transformations and finds that robust performance remains an open problem.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 18 Pith papers · 8 internal anchors

[1]

Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V . Le, Yannis Agiomyrgian- nakis, Rob Clark, and Rif A. Saurous. Tacotron: Towards end-to-end speech synthesis. In INTERSPEECH, pages 4006–4010. ISCA, 2017. 18

work page 2017
[2]

Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyr- giannakis, and Yonghui Wu. Natural TTS synthesis by conditioning wavenet on MEL spectro- gram predictions. In ICASSP, pages 4779–4783. IEEE, 2018

work page 2018
[3]

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan ¨Omer Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. Deep voice 3: 2000-speaker neural text-to-speech. CoRR, abs/1710.07654, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2000
[4]

Clarinet: Parallel wave generation in end-to-end text-to-speech

Wei Ping, Kainan Peng, and Jitong Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. In ICLR (Poster). OpenReview.net, 2019

work page 2019
[5]

Fast- speech: Fast, robust and controllable text to speech

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fast- speech: Fast, robust and controllable text to speech. In NeurIPS, pages 3165–3174, 2019

work page 2019
[6]

Neural speech synthesis with transformer network

Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. Neural speech synthesis with transformer network. In AAAI, pages 6706–6713. AAAI Press, 2019

work page 2019
[7]

Fastspeech 2: Fast and high-quality end-to-end text to speech

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. In ICLR. OpenReview.net, 2021

work page 2021
[8]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. Neural codec language models are zero-shot text to speech synthesizers. CoRR, abs/2301.02111, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Speak, read and prompt: High-fidelity text-to-speech with minimal supervision

Eugene Kharitonov, Damien Vincent, Zal ´an Borsos, Rapha¨el Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, and Neil Zeghidour. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Trans. Assoc. Comput. Linguistics , 11:1703–1718, 2023

work page 2023
[10]

ELLA-V: stable neural codec language modeling with alignment-guided sequence reordering.CoRR, abs/2401.07333, 2024

Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, and Xie Chen. ELLA-V: stable neural codec language modeling with alignment-guided sequence reordering.CoRR, abs/2401.07333, 2024

work page arXiv 2024
[11]

V ALL-T: decoder-only generative transducer for robust and decoding- controllable text-to-speech

Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, Hui Zhang, Xie Chen, and Kai Yu. V ALL-T: decoder-only generative transducer for robust and decoding- controllable text-to-speech. CoRR, abs/2401.14321, 2024

work page arXiv 2024
[12]

RALL-E: robust codec language modeling with chain-of-thought prompting for text-to-speech synthesis.CoRR, abs/2404.03204, 2024

Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, and Sheng Zhao. RALL-E: robust codec language modeling with chain-of-thought prompting for text-to-speech synthesis.CoRR, abs/2404.03204, 2024

work page arXiv 2024
[13]

V ALL-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers

Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, and Furu Wei. V ALL-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers. CoRR, abs/2406.05370, 2024

work page arXiv 2024
[14]

V ALL-E R: robust and efficient zero-shot text-to- speech synthesis via monotonic alignment

Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, and Furu Wei. V ALL-E R: robust and efficient zero-shot text-to- speech synthesis via monotonic alignment. CoRR, abs/2406.07855, 2024

work page arXiv 2024
[15]

Maskgct: Zero-shot text-to-speech with masked generative codec transformer

Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Shunsi Zhang, and Zhizheng Wu. Maskgct: Zero-shot text-to-speech with masked generative codec transformer. CoRR, abs/2409.00750, 2024

work page arXiv 2024
[16]

Autoregressive speech synthesis without vector quantization

Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, and Furu Wei. Autoregressive speech synthesis without vector quantization. CoRR, abs/2407.08551, 2024

work page arXiv 2024
[17]

Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens

Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, et al. Spark-tts: An efficient llm-based text-to- speech model with single-stream decoupled speech tokens. arXiv preprint arXiv:2503.01710, 2025

work page arXiv 2025
[18]

V oicebox: Text- guided multilingual universal speech generation at scale

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. V oicebox: Text- guided multilingual universal speech generation at scale. In NeurIPS, 2023

work page 2023
[19]

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiangyang Li, Wei Ye, Shikun 19 Zhang, Jiang Bian, Lei He, Jinyu Li, and Sheng Zhao. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. In ICML. OpenReview.net, 2024

work page 2024
[20]

V oiceflow: Efficient text-to- speech with rectified flow matching

Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, and Kai Yu. V oiceflow: Efficient text-to- speech with rectified flow matching. In ICASSP, pages 11121–11125. IEEE, 2024

work page 2024
[21]

Matcha-tts: A fast TTS architecture with conditional flow matching

Shivam Mehta, Ruibo Tu, Jonas Beskow, ´Eva Sz´ekely, and Gustav Eje Henter. Matcha-tts: A fast TTS architecture with conditional flow matching. In ICASSP, pages 11341–11345. IEEE, 2024

work page 2024
[22]

E3 TTS: easy end-to-end diffusion-based text to speech

Yuan Gao, Nobuyuki Morioka, Yu Zhang, and Nanxin Chen. E3 TTS: easy end-to-end diffusion-based text to speech. In ASRU, pages 1–8. IEEE, 2023

work page 2023
[23]

Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer

Keon Lee, Dong Won Kim, Jaehyeon Kim, and Jaewoong Cho. Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer. CoRR, abs/2406.11427, 2024

work page arXiv 2024
[24]

E2 TTS: embarrassingly easy fully non-autoregressive zero-shot TTS

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, and Naoyuki Kanda. E2 TTS: embarrassingly easy fully non-autoregressive zero-shot TTS. CoRR, abs/2406.18009, 2024

work page arXiv 2024
[25]

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching. CoRR, abs/2410.06885, 2024

work page internal anchor Pith review arXiv 2024
[26]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, J...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan. Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. CoRR, abs/2407.05407, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Touchtts: An embarrassingly simple tts framework that everyone can touch

Xingchen Song, Mengtao Xing, Changwei Ma, Shengqiang Li, Di Wu, Binbin Zhang, Fuping Pan, Dinghao Zhou, Yuekai Zhang, Shun Lei, et al. Touchtts: An embarrassingly simple tts framework that everyone can touch. arXiv preprint arXiv:2412.08237, 2024

work page arXiv 2024
[29]

Fireredtts: A foundation text-to-speech framework for industry-level generative speech appli- cations

Haohan Guo, Kun Liu, Feiyu Shen, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kaituo Xu. Fireredtts: A foundation text-to-speech framework for industry-level generative speech appli- cations. CoRR, abs/2409.03283, 2024

work page arXiv 2024
[30]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Ditar: Diffusion transformer autoregressive mod- eling for speech generation

Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, et al. Ditar: Diffusion transformer autoregressive mod- eling for speech generation. arXiv preprint arXiv:2502.03930, 2025

work page arXiv 2025
[32]

Indextts: An industrial-level controllable and efficient zero-shot text-to-speech system

Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, and Lu Wang. Indextts: An industrial-level controllable and efficient zero-shot text-to-speech system. arXiv preprint arXiv:2502.05512 , 2025

work page arXiv 2025
[33]

Minmo: A multimodal large language model for seamless voice interaction

Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, et al. Minmo: A multimodal large language model for seamless voice interaction. arXiv preprint arXiv:2501.06282, 2025

work page arXiv 2025
[34]

Finite scalar quantization: VQ-V AE made simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: VQ-V AE made simple. InICLR. OpenReview.net, 2024

work page 2024
[35]

Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms

Tongyi Speech Team. Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms. arxiv, 2024. 20

work page 2024
[36]

Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Ro- former: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

work page 2024
[37]

F5r-tts: Improving flow matching based text-to-speech with group relative policy optimization

Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, and Baoxun Wang. F5r-tts: Improving flow matching based text-to-speech with group relative policy optimization. arXiv preprint arXiv:2504.02407, 2025

work page arXiv 2025
[38]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Mossformer2: Combining transformer and rnn-free recurrent network for enhanced time-domain monaural speech separation, 2024

Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Jiaqi Yip, Dianwen Ng, and Bin Ma. Mossformer2: Combining transformer and rnn-free recurrent network for enhanced time-domain monaural speech separation, 2024

work page 2024
[40]

Faster whisper large v3, 2023

Systran. Faster whisper large v3, 2023

work page 2023
[41]

Canary 1b, 2024

NVIDIA. Canary 1b, 2024

work page 2024
[42]

Seamlessm4t v2, 2023

AI at Meta. Seamlessm4t v2, 2023

work page 2023
[43]

Montreal forced aligner: Trainable text-speech alignment using kaldi

Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sondereg- ger. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech 2017, pages 498–502, 2017

work page 2017
[44]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision , pages 4195–4205, 2023

work page 2023
[45]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Hon- olulu, Haw...

work page 2023
[46]

Funasr: A fundamental end-to-end speech recognition toolkit

Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, et al. Funasr: A fundamental end-to-end speech recognition toolkit. arXiv preprint arXiv:2305.11013, 2023

work page arXiv 2023
[47]

An en- hanced res2net with local and global feature fusion for speaker verification

Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, and Jiajun Qi. An en- hanced res2net with local and global feature fusion for speaker verification. arXiv preprint arXiv:2305.12838, 2023

work page arXiv 2023
[48]

Chandan K. A. Reddy, Vishak Gopal, and Ross Cutler. Dnsmos P.835: A non-intrusive percep- tual objective speech quality metric to evaluate noise suppressors. In ICASSP, pages 886–890. IEEE, 2022

work page 2022
[49]

Large-scale self-supervised speech representation learning for automatic speaker verification

Zhengyang Chen, Sanyuan Chen, Yu Wu, Yao Qian, Chengyi Wang, Shujie Liu, Yanmin Qian, and Michael Zeng. Large-scale self-supervised speech representation learning for automatic speaker verification. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 6147–6151. IEEE, 2022

work page 2022
[50]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Librispeech: an asr corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages 5206–5210. IEEE, 2015

work page 2015
[52]

Common voice: A massively-multilingual speech corpus

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019

work page arXiv 1912
[53]

Fleurs: Few-shot learning evaluation of universal representations of speech

Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805. IEEE, 2023. 21

work page 2022
[54]

Emobox: Multilingual multi-corpus speech emotion recog- nition toolkit and benchmark

Ziyang Ma, Mingjie Chen, Hezhao Zhang, Zhisheng Zheng, Wenxi Chen, Xiquan Li, Jiaxin Ye, Xie Chen, and Thomas Hain. Emobox: Multilingual multi-corpus speech emotion recog- nition toolkit and benchmark. In Proc. INTERSPEECH, 2024

work page 2024
[55]

Secap: Speech emotion captioning with large lan- guage model

Yaoxun Xu, Hangting Chen, Jianwei Yu, Qiaochu Huang, Zhiyong Wu, Shixiong Zhang, Guangzhi Li, Yi Luo, and Rongzhi Gu. Secap: Speech emotion captioning with large lan- guage model. arXiv preprint arXiv:2312.10381, 2023

work page arXiv 2023
[56]

Soundstream: An end-to-end neural audio codec

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022

work page 2022
[57]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdi- nov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

work page 2021
[58]

W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Under- standing Workshop (ASRU), pages 244–250. IEEE, 2021

work page 2021
[59]

A benchmark and analysis of discrete expressive speech resynthesis

Tu Anh Nguyen, Wei-Ning Hsu, Antony D’Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, and Em- manuel Dupoux. A benchmark and analysis of discrete expressive speech resynthesis. arXiv preprint arXiv:2308.05725, 2023

work page arXiv 2023
[60]

Unis- peaker: A unified approach for multimodality-driven speaker generation

Zhengyan Sheng, Zhihao Du, Heng Lu, Shiliang Zhang, and Zhen-Hua Ling. Unis- peaker: A unified approach for multimodality-driven speaker generation. arXiv preprint arXiv:2501.06394, 2025. 22

work page arXiv 2025