WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
hub Mixed citations
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
Mixed citation behavior. Most common role is background (33%).
abstract
In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo at https://funaudiollm.github.io/cosyvoice3.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 57representative citing papers
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.
Bagpiper-TTS uses natural language prompts and intent reasoning to derive rich captions that guide a single model for universal speech synthesis across classical TTS, multi-talker, singing, and role-play tasks.
Introduces the first benchmark for over-refusal in large audio language models using 3,000 pseudo-harmful audio samples and evaluates 12 models across six families, finding widespread over-refusal.
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.
PlanAudio introduces a unified autoregressive LLM framework with semantic latent chain-of-thought for generating composite speech and sound audio from free-form text, plus a new benchmark.
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
A new dataset, iterative coarse-to-fine localization framework, and segment-level IoU F1 metric tackle the open problem of detecting multiple unknown word-level inpainted regions in speech.
Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.
AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency without training.
ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
CAST benchmark shows language models infer correct word stress from discourse context but TTS systems frequently fail to produce it in speech.
SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.
ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.
PRIME-Speech adds low-latency speech output to frozen S2T LLMs by synchronizing a causal post-decoder with intermediate hidden states and using mixed conditioning plus turn-level KV-cache packing, preserving original S2T performance across translation, QA, and dialogue tasks.
HPRO uses a differentiable HD-Emo codec to extract separate content and style tokens and progressively aligns frame-, word-, and sentence-level rewards to improve emotional expressiveness in TTS while preserving intelligibility.
ProsoCodec models prosody as a conditional residual in a speech codec via text and speaker prefix conditioning, yielding improved prosody preservation and less timbre leakage in voice conversion experiments.
Bagpiper-Edit performs zero-shot open-ended audio editing by translating natural-language instructions into edited rich captions that guide generation anchored to the original audio.
A MoE-enhanced model with conditional distillation reduces speech-NVV EER from 38.93% to 22.66% and speech EER from 13.17% to 9.24% across 10 NVV types.
DeSRPA introduces a dual-level control vector method for inference-time intervention on frozen backbones to improve personality consistency and speech naturalness in role-playing agents over end-to-end fine-tuned baselines.
Joycent uses diffusion modeling and conditional layer normalization to synthesize accented speech from standard phones and references, claiming better accentedness and speaker preservation than two-stage baselines.
citing papers explorer
-
WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling
WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
-
FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
-
AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation
AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.
-
Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis
Bagpiper-TTS uses natural language prompts and intent reasoning to derive rich captions that guide a single model for universal speech synthesis across classical TTS, multi-talker, singing, and role-play tasks.
-
AOR-Bench: Do Large Audio Language Models Over-Refuse Pseudo-Harmful Queries?
Introduces the first benchmark for over-refusal in large audio language models using 3,000 pseudo-harmful audio samples and evaluates 12 models across six families, finding widespread over-refusal.
-
PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.
-
Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts
PlanAudio introduces a unified autoregressive LLM framework with semantic latent chain-of-thought for generating composite speech and sound audio from free-form text, plus a new benchmark.
-
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
-
Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization
A new dataset, iterative coarse-to-fine localization framework, and segment-level IoU F1 metric tackle the open problem of detecting multiple unknown word-level inpainted regions in speech.
-
SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding
Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.
-
AST: Adaptive, Seamless, and Training-Free Precise Speech Editing
AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency without training.
-
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench
ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
-
Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark
CAST benchmark shows language models infer correct word stress from discourse context but TTS systems frequently fail to produce it in speech.
-
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise
SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.
-
Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation
PRIME-Speech adds low-latency speech output to frozen S2T LLMs by synchronizing a causal post-decoder with intermediate hidden states and using mixed conditioning plus turn-level KV-cache packing, preserving original S2T performance across translation, QA, and dialogue tasks.
-
HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech
HPRO uses a differentiable HD-Emo codec to extract separate content and style tokens and progressively aligns frame-, word-, and sentence-level rewards to improve emotional expressiveness in TTS while preserving intelligibility.
-
ProsoCodec: Prosody-Oriented Speech Codec for Voice Conversion
ProsoCodec models prosody as a conditional residual in a speech codec via text and speaker prefix conditioning, yielding improved prosody preservation and less timbre leakage in voice conversion experiments.
-
Bagpiper-Edit: Zero-Shot Open-Ended Audio Editing via Rich-Caption
Bagpiper-Edit performs zero-shot open-ended audio editing by translating natural-language instructions into edited rich captions that guide generation anchored to the original audio.
-
Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach
A MoE-enhanced model with conditional distillation reduces speech-NVV EER from 38.93% to 22.66% and speech EER from 13.17% to 9.24% across 10 NVV types.
-
DeSRPA: Decoupled Speech Role-Playing Agent via Inference-Time Intervention
DeSRPA introduces a dual-level control vector method for inference-time intervention on frozen backbones to improve personality consistency and speech naturalness in role-playing agents over end-to-end fine-tuned baselines.
-
Joycent: Diffusion-based Accent TTS without Accented Phone Prediction
Joycent uses diffusion modeling and conditional layer normalization to synthesize accented speech from standard phones and references, claiming better accentedness and speaker preservation than two-stage baselines.
-
EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional Speech Synthesis
EmoInstruct-TTS uses Emotion2embed and an Instruction-Conditioned Emotion Flow Model (ICE-Flow) to generate acoustically grounded emotion representations from free-form instructions and integrate them into an LLM-based TTS pipeline.
-
TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech
TLDR groups codec tokens into patches for patch-level autoregressive modeling in pretrained TTS systems, yielding 1.8x speedup and 75% KV-cache reduction at patch size 4.
-
dots.tts Technical Report
dots.tts reports SOTA benchmark results on Seed-TTS-Eval and other tests via continuous latent-space autoregressive modeling with three listed innovations and code release.
-
Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy
READ is a reference-free ASR hypothesis scorer that measures acoustic discrepancy via conditional likelihood from a pretrained auto-regressive TTS model and yields up to 20% relative error rate reduction when used for refinement.
-
Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation
Foley-Omni extends isolated audio synthesis to joint generation of full video soundtracks across speech, effects, and music, with a new V2ST-Bench for evaluation showing competitive single-task results and gains in mixed-track consistency.
-
Benchmarking Speech-to-Speech Translation Models
COMPASS is a new reproducible benchmarking framework for S2ST that deploys 46 metrics on 1248 configurations, shows single-metric rankings mislead, reduces to 10 metrics per direction, and finds domain-specific metrics better match human judgments than standalone MOS predictors.
-
LaSR: Context-Aware Speech Recognition via Latent Reasoning
LaSR improves context-aware terminology recognition in speech LLMs by aligning latent CoT supervision on acoustic regions and introducing latent reasoning periods, shown on a new academic corpus to outperform standard fine-tuning without added latency.
-
SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue
SwanVoice is a zero-shot TTS system for 1-4 speakers that reports higher richness and hierarchy scores than open-source baselines on monologue and dialogue tasks via mixed training and DiffusionNFT post-training.
-
Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio
A training-free audio watermarking method that reduces vocabulary via community detection to boost detection robustness by orders of magnitude while resisting audio modifications.
-
RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching
RobustSpeechFlow improves TTS alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations, lowering WER from 1.44 to 1.38 on Seed-TTS-eval and CER on ZERO500.
-
SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis
SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.
-
From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation
S2ST-Omni 2 uses typology-informed hierarchical encoding, gated Dual-CTC, and typology-aware prompting to improve multilingual S2ST over flat-label baselines on CVSS-C, with gains in low-data regimes.
-
TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis
TTS-PRISM defines a 12-dimensional perceptual schema, builds a targeted diagnostic dataset via adversarial synthesis and expert labels, and tunes an end-to-end model that outperforms generalist LLMs in human alignment on a 1,600-sample Mandarin test set while profiling six TTS paradigms.
-
UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction
UAF is the first unified audio front-end LLM that turns multiple front-end tasks into one sequence prediction model processing streaming audio chunks and reference prompts to output semantic and control tokens for full-duplex interaction.
-
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation
MoVE uses specialized LoRA expert adapters and a soft router to translate non-verbal vocalizations in S2ST, reproducing them in 76% of cases versus at most 14% for baselines while scoring highest on naturalness and emotional fidelity.
-
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
-
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
-
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on multilingual benchmarks using 581k hours of open data.
-
FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts
FoleyDirector introduces structured temporal scripts and a fusion module to enable precise timing control in DiT-based video-to-audio generation while preserving audio fidelity.
-
AMR: Adaptive Modality Routing for Multimodal Polyglot Speaker Identification
AMR dynamically routes audio (W2V-BERT 2.0) and face (IResNet-18) embeddings via adapters and a KL-supervised router, reaching 99.07% average accuracy on POLY-SIM 2026 protocols and beating the FOP baseline by 32.73%.
-
FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech
FlowTTS-GRPO applies online RL with weighted multi-objective rewards to flow-matching TTS models via ODE-to-SDE conversion, reporting gains in speaker similarity and perceptual quality on CosyVoice 3.0 and F5-TTS.
-
End-to-End Training for Discrete Token LLM based TTS System
An end-to-end optimization framework jointly trains the speech tokenizer, LLM, FM model, and reward model for discrete-token TTS, reporting new SOTA WER of 0.78% and 1.56% on Seed-TTS-Eval with 0.6B LLM and 0.5B FM.
-
ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment
ImmersiveTTS proposes an environment-aware TTS system that integrates speech with environmental audio via multimodal diffusion transformer, joint attention, and domain-specific representation alignment, claiming superior naturalness and fidelity.
-
Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech
Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.
-
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.
-
AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech
AgentSteerTTS proposes a multi-agent framework with adversarial disentanglement, dual-stream anchoring via acoustic prototypes, and fast-slow feedback to achieve intent-faithful expressive TTS for composite instructions.
-
Enhancing Flow Matching with A Unified Guidance Framework for Efficient and Robust Speech Synthesis
Unified guidance framework for Flow Matching speech synthesis achieves nearly 3x faster inference and improved speaker similarity by combining heterogeneous data augmentation with intrinsic model guidance to eliminate CFG overhead.