Recognition: 3 theorem links
· Lean TheoremCosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Pith reviewed 2026-05-13 06:09 UTC · model grok-4.3
The pith
CosyVoice 2 reaches human-parity naturalness and near-zero latency in streaming speech synthesis via LLM optimizations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CosyVoice 2 incorporates finite-scalar quantization to improve codebook utilization of speech tokens, streamlines the text-speech language model to allow direct use of a pre-trained LLM as backbone, and develops a chunk-aware causal flow matching model to support streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode.
What carries the argument
Chunk-aware causal flow matching model, which processes audio in chunks to enable streaming while preserving full quality alongside finite-scalar quantization for improved token efficiency.
Load-bearing premise
The listed changes in quantization, architecture streamlining, and causal flow matching are what produce the human-parity naturalness and lossless streaming results.
What would settle it
A controlled listening test in which raters score CosyVoice 2 streaming outputs against matched human recordings on naturalness and intelligibility, with average scores falling measurably below human parity.
read the original abstract
In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, language models (LMs) and Flow Matching, CosyVoice demonstrated high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, significant progress has been made in multi-modal large language models (LLMs), where the response latency and real-time factor of speech synthesis play a crucial role in the interactive experience. Therefore, in this report, we present an improved streaming speech synthesis model, CosyVoice 2, which incorporates comprehensive and systematic optimizations. Specifically, we introduce finite-scalar quantization to improve the codebook utilization of speech tokens. For the text-speech LM, we streamline the model architecture to allow direct use of a pre-trained LLM as the backbone. In addition, we develop a chunk-aware causal flow matching model to support various synthesis scenarios, enabling both streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode. We invite readers to listen to the demos at https://funaudiollm.github.io/cosyvoice2.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents CosyVoice 2, an improved version of the prior CosyVoice model for multilingual speech synthesis. It incorporates finite-scalar quantization to enhance speech token codebook utilization, streamlines the text-speech language model to directly leverage a pre-trained LLM backbone, and introduces a chunk-aware causal flow matching model that supports both streaming and non-streaming synthesis in one architecture. Trained on a large-scale multilingual dataset, the work claims human-parity naturalness, minimal response latency, and virtually lossless quality specifically in streaming mode.
Significance. If the performance claims are substantiated, the work would represent a practical advance in low-latency, high-fidelity streaming TTS for interactive multimodal LLM applications, particularly by unifying streaming and non-streaming capabilities and improving token efficiency through the listed optimizations.
major comments (2)
- [Abstract] Abstract: The central claims of 'human-parity naturalness,' 'minimal response latency,' and 'virtually lossless synthesis quality' in streaming mode are asserted without any quantitative metrics, objective/subjective scores, baseline comparisons, ablation studies, or error analysis. This absence directly undermines evaluation of whether the finite-scalar quantization, streamlined LLM backbone, or chunk-aware causal flow matching produce the stated gains.
- [Architecture and Training sections] Architecture and Training sections: The descriptions of the three optimizations remain high-level narrative without equations, complexity analysis, or controlled experiments showing how each change (e.g., scalar quantization levels or chunk causality constraints) causally improves the reported metrics over the original CosyVoice.
minor comments (1)
- [Abstract] Abstract: The demo link is useful; however, the text should clarify the exact definition of 'virtually lossless' (e.g., with respect to which reference signal or perceptual metric).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each of the major comments below and indicate the revisions we plan to make.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of 'human-parity naturalness,' 'minimal response latency,' and 'virtually lossless synthesis quality' in streaming mode are asserted without any quantitative metrics, objective/subjective scores, baseline comparisons, ablation studies, or error analysis. This absence directly undermines evaluation of whether the finite-scalar quantization, streamlined LLM backbone, or chunk-aware causal flow matching produce the stated gains.
Authors: We acknowledge that the abstract does not contain specific numerical values, which is typical for abstracts to remain concise. The full paper contains extensive evaluation results in the Experiments section, including objective metrics like word error rate, speaker similarity scores, mean opinion scores (MOS) for naturalness, response latency measurements, comparisons against multiple baselines, and ablation studies on the individual components. To strengthen the abstract, we will add a few key quantitative results, such as the achieved MOS scores and latency values, to better substantiate the claims. revision: yes
-
Referee: [Architecture and Training sections] Architecture and Training sections: The descriptions of the three optimizations remain high-level narrative without equations, complexity analysis, or controlled experiments showing how each change (e.g., scalar quantization levels or chunk causality constraints) causally improves the reported metrics over the original CosyVoice.
Authors: The current descriptions aim to provide an accessible overview. We agree that adding more technical depth would be beneficial. In the revision, we will introduce equations for finite-scalar quantization, including the specific quantization levels and how they enhance codebook utilization compared to the previous approach. For the streamlined LLM, we will include details on the architecture modifications, parameter counts, and a complexity analysis. For the chunk-aware causal flow matching, we will provide the formulation of the causal mechanism and chunk processing. Furthermore, we will enhance the ablation studies to more clearly demonstrate the contribution of each optimization through controlled comparisons to the original CosyVoice model, showing improvements in the relevant metrics. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes an engineering progression from prior CosyVoice work through new optimizations (finite-scalar quantization, streamlined LLM backbone, chunk-aware causal flow matching) trained on large-scale multilingual data, with performance claims resting on empirical results rather than any closed-form derivation. No equations, fitted parameters renamed as predictions, or self-citation chains reduce the central claims to inputs by construction; the self-reference to previous work is purely contextual and not load-bearing for the reported human-parity or lossless outcomes.
Axiom & Free-Parameter Ledger
free parameters (2)
- quantization codebook size and scalar levels
- chunk size and causality constraints
axioms (2)
- domain assumption Discrete speech tokens from supervised training capture sufficient prosody and content for high-quality synthesis
- domain assumption Pre-trained LLM weights transfer effectively to text-to-speech token prediction without major retraining
Forward citations
Cited by 28 Pith papers
-
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
-
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
-
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
-
Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation
Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.
-
MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech
MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.
-
NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations
NVBench provides a standardized bilingual benchmark and evaluation protocol for assessing non-verbal vocalization generation, placement, and salience in text-to-speech systems.
-
AST: Adaptive, Seamless, and Training-Free Precise Speech Editing
AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency wi...
-
CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing
CoSyncDiT is a cognitive-inspired diffusion transformer that achieves state-of-the-art lip synchronization and naturalness in movie dubbing by guiding noise-to-speech generation through acoustic, visual, and contextua...
-
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
-
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
FLAIR enables spoken dialogue AI to conduct continuous latent reasoning while perceiving speech through recursive latent embeddings and an ELBO-based finetuning objective.
-
AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling
AuDirector is a self-reflective closed-loop multi-agent framework that generates immersive audio narratives with improved structural coherence, emotional expressiveness, and acoustic fidelity via identity-aware voice ...
-
Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation
L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.
-
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
-
TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis
TTS-PRISM defines a 12-dimensional perceptual schema, builds a targeted diagnostic dataset via adversarial synthesis and expert labels, and tunes an end-to-end model that outperforms generalist LLMs in human alignment...
-
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...
-
Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation
SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...
-
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
-
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on mul...
-
FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts
FoleyDirector introduces structured temporal scripts and a fusion module to enable precise timing control in DiT-based video-to-audio generation while preserving audio fidelity.
-
Borderless Long Speech Synthesis
Borderless Long Speech Synthesis unifies voice design, multi-speaker TTS, and long-form generation via Global-Sentence-Token annotations, CoT reasoning, and a Structured Semantic Interface for agent-centric control.
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
-
Sema: Semantic Transport for Real-Time Multimodal Agents
Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while keeping multimodal agent task accuracy within 0.7 percentage points of raw baselines in WAN simulations.
-
Qwen3.5-Omni Technical Report
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...
-
Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck
A singing voice conversion system with boundary-aware information bottleneck and high-frequency augmentation achieves the best naturalness in SVCC2025 subjective tests while using less extra data than competitors.
-
WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models
WAND adapts AR-TTS models to constant complexity via windowed attention and distillation, cutting KV cache memory by up to 66.2% while preserving quality and achieving length-invariant latency.
-
Qwen2.5-Omni Technical Report
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...
-
Empowering Video Translation using Multimodal Large Language Models
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
Reference graph
Works this paper leans on
-
[1]
Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V . Le, Yannis Agiomyrgian- nakis, Rob Clark, and Rif A. Saurous. Tacotron: Towards end-to-end speech synthesis. In INTERSPEECH, pages 4006–4010. ISCA, 2017
work page 2017
-
[2]
Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyr- giannakis, and Yonghui Wu. Natural TTS synthesis by conditioning wavenet on MEL spectro- gram predictions. In ICASSP, pages 4779–4783. IEEE, 2018
work page 2018
-
[3]
Deep voice 3: 2000-speaker neural text-to-speech
Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan ¨Omer Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. Deep voice 3: 2000-speaker neural text-to-speech. CoRR, abs/1710.07654, 2017
-
[4]
Clarinet: Parallel wave generation in end-to-end text-to-speech
Wei Ping, Kainan Peng, and Jitong Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. In ICLR (Poster). OpenReview.net, 2019
work page 2019
-
[5]
Fast- speech: Fast, robust and controllable text to speech
Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fast- speech: Fast, robust and controllable text to speech. In NeurIPS, pages 3165–3174, 2019. 15
work page 2019
-
[6]
Neural speech synthesis with transformer network
Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. Neural speech synthesis with transformer network. In AAAI, pages 6706–6713. AAAI Press, 2019
work page 2019
-
[7]
Fastspeech 2: Fast and high-quality end-to-end text to speech
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. In ICLR. OpenReview.net, 2021
work page 2021
-
[8]
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. Neural codec language models are zero-shot text to speech synthesizers. CoRR, abs/2301.02111, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Soundstream: An end-to-end neural audio codec
Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022
work page 2022
-
[10]
High fidelity neural audio compression
Alexandre D ´efossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. Trans. Mach. Learn. Res., 2023, 2023
work page 2023
-
[11]
Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec
Zhihao Du, Shiliang Zhang, Kai Hu, and Siqi Zheng. Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec. In ICASSP, pages 591–595. IEEE, 2024
work page 2024
-
[12]
Speak, read and prompt: High-fidelity text-to-speech with minimal supervision
Eugene Kharitonov, Damien Vincent, Zal ´an Borsos, Rapha¨el Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, and Neil Zeghidour. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Trans. Assoc. Comput. Linguistics , 11:1703–1718, 2023
work page 2023
-
[13]
Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, and Xie Chen. ELLA-V: stable neural codec language modeling with alignment-guided sequence reordering.CoRR, abs/2401.07333, 2024
-
[14]
V ALL-T: decoder-only generative transducer for robust and decoding- controllable text-to-speech
Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, Hui Zhang, Xie Chen, and Kai Yu. V ALL-T: decoder-only generative transducer for robust and decoding- controllable text-to-speech. CoRR, abs/2401.14321, 2024
-
[15]
Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, and Sheng Zhao. RALL-E: robust codec language modeling with chain-of-thought prompting for text-to-speech synthesis.CoRR, abs/2404.03204, 2024
-
[16]
V ALL-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers
Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, and Furu Wei. V ALL-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers. CoRR, abs/2406.05370, 2024
-
[17]
Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, and Furu Wei. V ALL-E R: robust and efficient zero-shot text-to- speech synthesis via monotonic alignment. CoRR, abs/2406.07855, 2024
-
[18]
Maskgct: Zero-shot text-to-speech with masked generative codec transformer
Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Shunsi Zhang, and Zhizheng Wu. Maskgct: Zero-shot text-to-speech with masked generative codec transformer. CoRR, abs/2409.00750, 2024
-
[19]
Wavenext: Convnext-based fast neural vocoder without ISTFT layer
Takuma Okamoto, Haruki Yamashita, Yamato Ohtani, Tomoki Toda, and Hisashi Kawai. Wavenext: Convnext-based fast neural vocoder without ISTFT layer. In ASRU, pages 1–8. IEEE, 2023
work page 2023
-
[20]
Hubert Siuzdak. V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis. In ICLR. OpenReview.net, 2024
work page 2024
-
[21]
Autoregressive speech synthesis without vector quantization
Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, and Furu Wei. Autoregressive speech synthesis without vector quantization. CoRR, abs/2407.08551, 2024
-
[22]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , 2020
work page 2020
-
[23]
Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR. OpenReview.net, 2021. 16
work page 2021
-
[24]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In ICLR. OpenReview.net, 2023
work page 2023
-
[25]
V oicebox: Text- guided multilingual universal speech generation at scale
Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. V oicebox: Text- guided multilingual universal speech generation at scale. In NeurIPS, 2023
work page 2023
-
[26]
Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models
Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiangyang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, and Sheng Zhao. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. In ICML. OpenReview.net, 2024
work page 2024
-
[27]
V oiceflow: Efficient text-to- speech with rectified flow matching
Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, and Kai Yu. V oiceflow: Efficient text-to- speech with rectified flow matching. In ICASSP, pages 11121–11125. IEEE, 2024
work page 2024
-
[28]
Matcha-tts: A fast TTS architecture with conditional flow matching
Shivam Mehta, Ruibo Tu, Jonas Beskow, ´Eva Sz´ekely, and Gustav Eje Henter. Matcha-tts: A fast TTS architecture with conditional flow matching. In ICASSP, pages 11341–11345. IEEE, 2024
work page 2024
-
[29]
E3 TTS: easy end-to-end diffusion-based text to speech
Yuan Gao, Nobuyuki Morioka, Yu Zhang, and Nanxin Chen. E3 TTS: easy end-to-end diffusion-based text to speech. In ASRU, pages 1–8. IEEE, 2023
work page 2023
-
[30]
Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer
Keon Lee, Dong Won Kim, Jaehyeon Kim, and Jaewoong Cho. Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer. CoRR, abs/2406.11427, 2024
-
[31]
E2 TTS: embarrassingly easy fully non-autoregressive zero-shot TTS
Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, and Naoyuki Kanda. E2 TTS: embarrassingly easy fully non-autoregressive zero-shot TTS. CoRR, abs/2406.18009, 2024
-
[32]
F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching. CoRR, abs/2410.06885, 2024
-
[33]
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, J...
work page internal anchor Pith review arXiv 2024
-
[34]
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan. Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. CoRR, abs/2407.05407, 2024
-
[35]
Haohan Guo, Kun Liu, Feiyu Shen, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kaituo Xu. Fireredtts: A foundation text-to-speech framework for industry-level generative speech appli- cations. CoRR, abs/2409.03283, 2024
-
[36]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Speechgpt: Empowering large language models with intrinsic cross-modal conversational abil- ities
Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abil- ities. In EMNLP (Findings), pages 15757–15773. Association for Computational Linguistics, 2023
work page 2023
-
[38]
Trung Dang, David Aponte, Dung N. Tran, and Kazuhito Koishida. Livespeech: Low- latency zero-shot text-to-speech via autoregressive modeling of audio discrete codes. CoRR, abs/2406.02897, 2024
-
[39]
Tran, Tianyi Chen, and Kazuhito Koishida
Trung Dang, David Aponte, Dung N. Tran, Tianyi Chen, and Kazuhito Koishida. Zero-shot text-to-speech from continuous text streams. CoRR, abs/2410.00767, 2024. 17
-
[40]
BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100k hours of data
Mateusz Lajszczak, Guillermo C ´ambara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, ´Alvaro Mart´ın-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszynska, Haohan Guo, Bartosz Putrycz, Soledad L ´opez Gambino, Kayeon Yoo, Elena Sokolova, and Thomas Drugman. BASE TTS: lessons from building a billion- paramet...
-
[41]
Speak while you think: Streaming speech synthesis during text generation
Avihu Dekel, Slava Shechtman, Raul Fernandez, David Haws, Zvi Kons, and Ron Hoory. Speak while you think: Streaming speech synthesis during text generation. In ICASSP, pages 11931–11935. IEEE, 2024
work page 2024
-
[42]
Finite scalar quantization: VQ-V AE made simple
Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: VQ-V AE made simple. InICLR. OpenReview.net, 2024
work page 2024
-
[43]
Tongyi Speech Team. Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms. arxiv, 2024
work page 2024
-
[44]
Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Ro- former: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024
work page 2024
-
[45]
Qwen2.5: A party of foundation models, September 2024
Qwen Team. Qwen2.5: A party of foundation models, September 2024
work page 2024
-
[46]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning , pages 8162–8171. PMLR, 2021
work page 2021
-
[48]
V oicebox: Text-guided mul- tilingual universal speech generation at scale
Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, et al. V oicebox: Text-guided mul- tilingual universal speech generation at scale. Advances in neural information processing sys- tems, 36, 2024
work page 2024
-
[49]
Manning, Stefano Ermon, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In NeurIPS, 2023
work page 2023
-
[50]
Zhifu Gao, Shiliang Zhang, Ian McLoughlin, and Zhijie Yan. Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition. In Interspeech, pages 2063–2067. ISCA, 2022
work page 2063
-
[51]
Unicats: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding
Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, and Kai Yu. Unicats: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding. In AAAI, pages 17924–17932. AAAI Press, 2024
work page 2024
-
[52]
An enhanced res2net with local and global feature fusion for speaker verification
Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, and Jiajun Qi. An enhanced res2net with local and global feature fusion for speaker verification. In Interspeech. ISCA, 2023
work page 2023
-
[53]
Chandan K. A. Reddy, Vishak Gopal, and Ross Cutler. Dnsmos P.835: A non-intrusive percep- tual objective speech quality metric to evaluate noise suppressors. In ICASSP, pages 886–890. IEEE, 2022
work page 2022
-
[54]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Hon- olulu, Haw...
work page 2023
-
[55]
A large-scale evaluation of speech foundation models
Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, et al. A large-scale evaluation of speech foundation models. IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, 2024
work page 2024
-
[56]
2noise. Chattts. https://github.com/2noise/ChatTTS, 2024
work page 2024
- [57]
-
[58]
Openvoice: Versatile instant voice cloning
Zengyi Qin, Wenliang Zhao, Xumin Yu, and Xin Sun. Openvoice: Versatile instant voice cloning. CoRR, abs/2312.01479, 2023
-
[59]
Natural language guidance of high-fidelity text-to-speech with synthetic annotations
Daniel Lyth and Simon King. Natural language guidance of high-fidelity text-to-speech with synthetic annotations. CoRR, abs/2402.01912, 2024
-
[60]
Netease Youdao. Emotivoice. https://github.com/netease-youdao/EmotiVoice, 2024. 19
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.