hub Mixed citations

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang · 2025 · cs.SD · arXiv 2505.17589

Mixed citation behavior. Most common role is background (33%).

42 Pith papers citing it

Background 33% of classified citations

open full Pith review browse 42 citing papers arXiv PDF

abstract

In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo at https://funaudiollm.github.io/cosyvoice3.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 method 2 baseline 1 dataset 1

citation-polarity summary

background 2 use method 2 baseline 1 use dataset 1

representative citing papers

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

eess.AS · 2026-06-02 · unverdicted · novelty 8.0

WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

cs.SD · 2026-05-27 · unverdicted · novelty 7.0

PlanAudio introduces a unified autoregressive LLM framework with semantic latent chain-of-thought for generating composite speech and sound audio from free-form text, plus a new benchmark.

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

eess.AS · 2026-05-10 · unverdicted · novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

cs.SD · 2026-05-04 · unverdicted · novelty 7.0

A new dataset, iterative coarse-to-fine localization framework, and segment-level IoU F1 metric tackle the open problem of detecting multiple unknown word-level inpainted regions in speech.

SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding

eess.AS · 2026-04-29 · unverdicted · novelty 7.0

Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.

AST: Adaptive, Seamless, and Training-Free Precise Speech Editing

cs.SD · 2026-04-17 · unverdicted · novelty 7.0

AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency without training.

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

cs.AI · 2026-04-16 · unverdicted · novelty 7.0

ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.

Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

CAST benchmark shows language models infer correct word stress from discourse context but TTS systems frequently fail to produce it in speech.

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

cs.IR · 2026-02-13 · unverdicted · novelty 7.0

SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.

ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

cs.CV · 2025-12-16 · unverdicted · novelty 7.0

ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.

Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation

eess.AS · 2026-06-29 · unverdicted · novelty 6.0

PRIME-Speech adds low-latency speech output to frozen S2T LLMs by synchronizing a causal post-decoder with intermediate hidden states and using mixed conditioning plus turn-level KV-cache packing, preserving original S2T performance across translation, QA, and dialogue tasks.

HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

eess.AS · 2026-06-26 · unverdicted · novelty 6.0

HPRO uses a differentiable HD-Emo codec to extract separate content and style tokens and progressively aligns frame-, word-, and sentence-level rewards to improve emotional expressiveness in TTS while preserving intelligibility.

Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy

eess.AS · 2026-06-03 · unverdicted · novelty 6.0

READ is a reference-free ASR hypothesis scorer that measures acoustic discrepancy via conditional likelihood from a pretrained auto-regressive TTS model and yields up to 20% relative error rate reduction when used for refinement.

Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation

cs.SD · 2026-06-02 · unverdicted · novelty 6.0

Foley-Omni extends isolated audio synthesis to joint generation of full video soundtracks across speech, effects, and music, with a new V2ST-Bench for evaluation showing competitive single-task results and gains in mixed-track consistency.

Benchmarking Speech-to-Speech Translation Models

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

COMPASS is a new reproducible benchmarking framework for S2ST that deploys 46 metrics on 1248 configurations, shows single-metric rankings mislead, reduces to 10 metrics per direction, and finds domain-specific metrics better match human judgments than standalone MOS predictors.

LaSR: Context-Aware Speech Recognition via Latent Reasoning

cs.CL · 2026-05-30 · unverdicted · novelty 6.0

LaSR improves context-aware terminology recognition in speech LLMs by aligning latent CoT supervision on acoustic regions and introducing latent reasoning periods, shown on a new academic corpus to outperform standard fine-tuning without added latency.

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

eess.AS · 2026-05-29 · unverdicted · novelty 6.0

SwanVoice is a zero-shot TTS system for 1-4 speakers that reports higher richness and hierarchy scores than open-source baselines on monologue and dialogue tasks via mixed training and DiffusionNFT post-training.

Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio

cs.LG · 2026-05-25 · unverdicted · novelty 6.0

A training-free audio watermarking method that reduces vocabulary via community detection to boost detection robustness by orders of magnitude while resisting audio modifications.

RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

cs.SD · 2026-05-21 · unverdicted · novelty 6.0

RobustSpeechFlow improves TTS alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations, lowering WER from 1.44 to 1.38 on Seed-TTS-eval and CER on ZERO500.

SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis

eess.AS · 2026-05-16 · unverdicted · novelty 6.0

SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer