hub Mixed citations

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang · 2025 · cs.SD · arXiv 2505.17589

Mixed citation behavior. Most common role is background (33%).

57 Pith papers citing it

Background 33% of classified citations

open full Pith review browse 57 citing papers arXiv PDF

abstract

In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo at https://funaudiollm.github.io/cosyvoice3.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 method 2 baseline 1 dataset 1

citation-polarity summary

background 2 use method 2 baseline 1 use dataset 1

representative citing papers

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

eess.AS · 2026-06-02 · unverdicted · novelty 8.0

WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation

eess.AS · 2026-06-22 · unverdicted · novelty 7.0

AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.

Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

cs.CL · 2026-06-22 · unverdicted · novelty 7.0

Bagpiper-TTS uses natural language prompts and intent reasoning to derive rich captions that guide a single model for universal speech synthesis across classical TTS, multi-talker, singing, and role-play tasks.

AOR-Bench: Do Large Audio Language Models Over-Refuse Pseudo-Harmful Queries?

cs.SD · 2026-06-19 · unverdicted · novelty 7.0

Introduces the first benchmark for over-refusal in large audio language models using 3,000 pseudo-harmful audio samples and evaluates 12 models across six families, finding widespread over-refusal.

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

cs.SD · 2026-05-27 · unverdicted · novelty 7.0

PlanAudio introduces a unified autoregressive LLM framework with semantic latent chain-of-thought for generating composite speech and sound audio from free-form text, plus a new benchmark.

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

eess.AS · 2026-05-10 · unverdicted · novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

cs.SD · 2026-05-04 · unverdicted · novelty 7.0

A new dataset, iterative coarse-to-fine localization framework, and segment-level IoU F1 metric tackle the open problem of detecting multiple unknown word-level inpainted regions in speech.

SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding

eess.AS · 2026-04-29 · unverdicted · novelty 7.0

Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.

AST: Adaptive, Seamless, and Training-Free Precise Speech Editing

cs.SD · 2026-04-17 · unverdicted · novelty 7.0

AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency without training.

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

cs.AI · 2026-04-16 · unverdicted · novelty 7.0

ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.

Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

CAST benchmark shows language models infer correct word stress from discourse context but TTS systems frequently fail to produce it in speech.

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

cs.IR · 2026-02-13 · unverdicted · novelty 7.0

SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.

ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

cs.CV · 2025-12-16 · unverdicted · novelty 7.0

ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.

Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation

eess.AS · 2026-06-29 · unverdicted · novelty 6.0

PRIME-Speech adds low-latency speech output to frozen S2T LLMs by synchronizing a causal post-decoder with intermediate hidden states and using mixed conditioning plus turn-level KV-cache packing, preserving original S2T performance across translation, QA, and dialogue tasks.

HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

eess.AS · 2026-06-26 · unverdicted · novelty 6.0

HPRO uses a differentiable HD-Emo codec to extract separate content and style tokens and progressively aligns frame-, word-, and sentence-level rewards to improve emotional expressiveness in TTS while preserving intelligibility.

ProsoCodec: Prosody-Oriented Speech Codec for Voice Conversion

eess.AS · 2026-06-20 · unverdicted · novelty 6.0

ProsoCodec models prosody as a conditional residual in a speech codec via text and speaker prefix conditioning, yielding improved prosody preservation and less timbre leakage in voice conversion experiments.

Bagpiper-Edit: Zero-Shot Open-Ended Audio Editing via Rich-Caption

cs.SD · 2026-06-19 · unverdicted · novelty 6.0

Bagpiper-Edit performs zero-shot open-ended audio editing by translating natural-language instructions into edited rich captions that guide generation anchored to the original audio.

Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach

eess.AS · 2026-06-19 · unverdicted · novelty 6.0

A MoE-enhanced model with conditional distillation reduces speech-NVV EER from 38.93% to 22.66% and speech EER from 13.17% to 9.24% across 10 NVV types.

DeSRPA: Decoupled Speech Role-Playing Agent via Inference-Time Intervention

cs.SD · 2026-06-16 · unverdicted · novelty 6.0

DeSRPA introduces a dual-level control vector method for inference-time intervention on frozen backbones to improve personality consistency and speech naturalness in role-playing agents over end-to-end fine-tuned baselines.

Joycent: Diffusion-based Accent TTS without Accented Phone Prediction

cs.SD · 2026-06-15 · unverdicted · novelty 6.0

Joycent uses diffusion modeling and conditional layer normalization to synthesize accented speech from standard phones and references, claiming better accentedness and speaker preservation than two-stage baselines.

citing papers explorer

Showing 50 of 55 citing papers after filters.

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling eess.AS · 2026-06-02 · unverdicted · none · ref 16 · internal anchor
WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
VoxSafeBench: Not Just What Is Said, but Who, How, and Where cs.SD · 2026-04-16 · unverdicted · none · ref 30 · internal anchor
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model cs.SD · 2026-06-30 · unverdicted · none · ref 93 · internal anchor
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation eess.AS · 2026-06-22 · unverdicted · none · ref 10 · internal anchor
AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.
Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis cs.CL · 2026-06-22 · unverdicted · none · ref 22 · internal anchor
Bagpiper-TTS uses natural language prompts and intent reasoning to derive rich captions that guide a single model for universal speech synthesis across classical TTS, multi-talker, singing, and role-play tasks.
AOR-Bench: Do Large Audio Language Models Over-Refuse Pseudo-Harmful Queries? cs.SD · 2026-06-19 · unverdicted · none · ref 32 · internal anchor
Introduces the first benchmark for over-refusal in large audio language models using 3,000 pseudo-harmful audio samples and evaluates 12 models across six families, finding widespread over-refusal.
PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects cs.CL · 2026-05-31 · unverdicted · none · ref 33 · internal anchor
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.
Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts cs.SD · 2026-05-27 · unverdicted · none · ref 3 · internal anchor
PlanAudio introduces a unified autoregressive LLM framework with semantic latent chain-of-thought for generating composite speech and sound audio from free-form text, plus a new benchmark.
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech eess.AS · 2026-05-10 · unverdicted · none · ref 8 · internal anchor
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 109 · internal anchor
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization cs.SD · 2026-05-04 · unverdicted · none · ref 3 · internal anchor
A new dataset, iterative coarse-to-fine localization framework, and segment-level IoU F1 metric tackle the open problem of detecting multiple unknown word-level inpainted regions in speech.
SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding eess.AS · 2026-04-29 · unverdicted · none · ref 25 · internal anchor
Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.
AST: Adaptive, Seamless, and Training-Free Precise Speech Editing cs.SD · 2026-04-17 · unverdicted · none · ref 19 · internal anchor
AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency without training.
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench cs.AI · 2026-04-16 · unverdicted · none · ref 31 · internal anchor
ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark cs.CL · 2026-04-12 · unverdicted · none · ref 13 · internal anchor
CAST benchmark shows language models infer correct word stress from discourse context but TTS systems frequently fail to produce it in speech.
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise cs.IR · 2026-02-13 · unverdicted · none · ref 8 · internal anchor
SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.
Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation eess.AS · 2026-06-29 · unverdicted · none · ref 16 · internal anchor
PRIME-Speech adds low-latency speech output to frozen S2T LLMs by synchronizing a causal post-decoder with intermediate hidden states and using mixed conditioning plus turn-level KV-cache packing, preserving original S2T performance across translation, QA, and dialogue tasks.
HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech eess.AS · 2026-06-26 · unverdicted · none · ref 32 · internal anchor
HPRO uses a differentiable HD-Emo codec to extract separate content and style tokens and progressively aligns frame-, word-, and sentence-level rewards to improve emotional expressiveness in TTS while preserving intelligibility.
ProsoCodec: Prosody-Oriented Speech Codec for Voice Conversion eess.AS · 2026-06-20 · unverdicted · none · ref 11 · internal anchor
ProsoCodec models prosody as a conditional residual in a speech codec via text and speaker prefix conditioning, yielding improved prosody preservation and less timbre leakage in voice conversion experiments.
Bagpiper-Edit: Zero-Shot Open-Ended Audio Editing via Rich-Caption cs.SD · 2026-06-19 · unverdicted · none · ref 35 · internal anchor
Bagpiper-Edit performs zero-shot open-ended audio editing by translating natural-language instructions into edited rich captions that guide generation anchored to the original audio.
Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach eess.AS · 2026-06-19 · unverdicted · none · ref 10 · internal anchor
A MoE-enhanced model with conditional distillation reduces speech-NVV EER from 38.93% to 22.66% and speech EER from 13.17% to 9.24% across 10 NVV types.
DeSRPA: Decoupled Speech Role-Playing Agent via Inference-Time Intervention cs.SD · 2026-06-16 · unverdicted · none · ref 37 · internal anchor
DeSRPA introduces a dual-level control vector method for inference-time intervention on frozen backbones to improve personality consistency and speech naturalness in role-playing agents over end-to-end fine-tuned baselines.
Joycent: Diffusion-based Accent TTS without Accented Phone Prediction cs.SD · 2026-06-15 · unverdicted · none · ref 31 · internal anchor
Joycent uses diffusion modeling and conditional layer normalization to synthesize accented speech from standard phones and references, claiming better accentedness and speaker preservation than two-stage baselines.
EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional Speech Synthesis cs.CL · 2026-06-08 · unverdicted · none · ref 27 · internal anchor
EmoInstruct-TTS uses Emotion2embed and an Instruction-Conditioned Emotion Flow Model (ICE-Flow) to generate acoustically grounded emotion representations from free-form instructions and integrate them into an LLM-based TTS pipeline.
TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech cs.SD · 2026-06-08 · unverdicted · none · ref 10 · internal anchor
TLDR groups codec tokens into patches for patch-level autoregressive modeling in pretrained TTS systems, yielding 1.8x speedup and 75% KV-cache reduction at patch size 4.
dots.tts Technical Report cs.SD · 2026-06-05 · unverdicted · none · ref 3 · internal anchor
dots.tts reports SOTA benchmark results on Seed-TTS-Eval and other tests via continuous latent-space autoregressive modeling with three listed innovations and code release.
Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy eess.AS · 2026-06-03 · unverdicted · none · ref 30 · internal anchor
READ is a reference-free ASR hypothesis scorer that measures acoustic discrepancy via conditional likelihood from a pretrained auto-regressive TTS model and yields up to 20% relative error rate reduction when used for refinement.
Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation cs.SD · 2026-06-02 · unverdicted · none · ref 12 · internal anchor
Foley-Omni extends isolated audio synthesis to joint generation of full video soundtracks across speech, effects, and music, with a new V2ST-Bench for evaluation showing competitive single-task results and gains in mixed-track consistency.
Benchmarking Speech-to-Speech Translation Models cs.CL · 2026-06-02 · unverdicted · none · ref 21 · internal anchor
COMPASS is a new reproducible benchmarking framework for S2ST that deploys 46 metrics on 1248 configurations, shows single-metric rankings mislead, reduces to 10 metrics per direction, and finds domain-specific metrics better match human judgments than standalone MOS predictors.
LaSR: Context-Aware Speech Recognition via Latent Reasoning cs.CL · 2026-05-30 · unverdicted · none · ref 22 · internal anchor
LaSR improves context-aware terminology recognition in speech LLMs by aligning latent CoT supervision on acoustic regions and introducing latent reasoning periods, shown on a new academic corpus to outperform standard fine-tuning without added latency.
SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue eess.AS · 2026-05-29 · unverdicted · none · ref 11 · internal anchor
SwanVoice is a zero-shot TTS system for 1-4 speakers that reports higher richness and hierarchy scores than open-source baselines on monologue and dialogue tasks via mixed training and DiffusionNFT post-training.
Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio cs.LG · 2026-05-25 · unverdicted · none · ref 8 · internal anchor
A training-free audio watermarking method that reduces vocabulary via community detection to boost detection robustness by orders of magnitude while resisting audio modifications.
RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching cs.SD · 2026-05-21 · unverdicted · none · ref 32 · internal anchor
RobustSpeechFlow improves TTS alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations, lowering WER from 1.44 to 1.38 on Seed-TTS-eval and CER on ZERO500.
SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis eess.AS · 2026-05-16 · unverdicted · none · ref 9 · internal anchor
SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.
From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation cs.CL · 2026-05-15 · unverdicted · none · ref 41 · internal anchor
S2ST-Omni 2 uses typology-informed hierarchical encoding, gated Dual-CTC, and typology-aware prompting to improve multilingual S2ST over flat-label baselines on CVSS-C, with gains in low-data regimes.
TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis cs.CL · 2026-04-24 · unverdicted · none · ref 7 · internal anchor
TTS-PRISM defines a 12-dimensional perceptual schema, builds a targeted diagnostic dataset via adversarial synthesis and expert labels, and tunes an end-to-end model that outperforms generalist LLMs in human alignment on a 1,600-sample Mandarin test set while profiling six TTS paradigms.
UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction cs.AI · 2026-04-21 · unverdicted · none · ref 10 · internal anchor
UAF is the first unified audio front-end LLM that turns multiple front-end tasks into one sequence prediction model processing streaming audio chunks and reference prompts to output semantic and control tokens for full-duplex interaction.
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation cs.CL · 2026-04-19 · unverdicted · none · ref 42 · internal anchor
MoVE uses specialized LoRA expert adapters and a soft router to translate non-verbal vocalizations in S2ST, reproducing them in 76% of cases versus at most 14% for baselines while scoring highest on naturalness and emotional fidelity.
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use cs.SD · 2026-04-17 · unverdicted · none · ref 30 · internal anchor
Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models cs.CL · 2026-04-11 · unverdicted · none · ref 26 · internal anchor
ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models cs.CL · 2026-04-01 · unverdicted · none · ref 33 · internal anchor
OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on multilingual benchmarks using 581k hours of open data.
FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts cs.SD · 2026-03-20 · unverdicted · none · ref 8 · internal anchor
FoleyDirector introduces structured temporal scripts and a fusion module to enable precise timing control in DiT-based video-to-audio generation while preserving audio fidelity.
AMR: Adaptive Modality Routing for Multimodal Polyglot Speaker Identification cs.LG · 2026-06-28 · unverdicted · none · ref 11 · internal anchor
AMR dynamically routes audio (W2V-BERT 2.0) and face (IResNet-18) embeddings via adapters and a KL-supervised router, reaching 99.07% average accuracy on POLY-SIM 2026 protocols and beating the FOP baseline by 32.73%.
FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech eess.AS · 2026-06-22 · unverdicted · none · ref 12 · internal anchor
FlowTTS-GRPO applies online RL with weighted multi-objective rewards to flow-matching TTS models via ODE-to-SDE conversion, reporting gains in speaker similarity and perceptual quality on CosyVoice 3.0 and F5-TTS.
End-to-End Training for Discrete Token LLM based TTS System cs.SD · 2026-06-08 · unverdicted · none · ref 2 · internal anchor
An end-to-end optimization framework jointly trains the speech tokenizer, LLM, FM model, and reward model for discrete-token TTS, reporting new SOTA WER of 0.78% and 1.56% on Seed-TTS-Eval with 0.6B LLM and 0.5B FM.
ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment eess.AS · 2026-05-29 · unverdicted · none · ref 14 · internal anchor
ImmersiveTTS proposes an environment-aware TTS system that integrates speech with environmental audio via multimodal diffusion transformer, joint attention, and domain-specific representation alignment, claiming superior naturalness and fidelity.
Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech eess.AS · 2026-05-20 · unverdicted · none · ref 6 · internal anchor
Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation cs.CV · 2026-05-17 · unverdicted · none · ref 13 · internal anchor
Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.
AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech cs.CV · 2026-05-14 · unverdicted · none · ref 100 · internal anchor
AgentSteerTTS proposes a multi-agent framework with adversarial disentanglement, dual-stream anchoring via acoustic prototypes, and fast-slow feedback to achieve intent-faithful expressive TTS for composite instructions.
Enhancing Flow Matching with A Unified Guidance Framework for Efficient and Robust Speech Synthesis cs.SD · 2026-07-01 · unverdicted · none · ref 12 · internal anchor
Unified guidance framework for Flow Matching speech synthesis achieves nearly 3x faster inference and improved speaker similarity by combining heterogeneous data augmentation with intrinsic model guidance to eliminate CFG overhead.

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer