Poly-SVC converts singing voices from polyphonic recordings while keeping melody, lyrics, and harmonies by combining CQT-based pitch extraction with a conditional flow matching diffusion decoder.
Openvoice: Versatile instant voice cloning
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
V.O.I.C.E is a new taxonomy that organizes synthetic voice risks into five categories and shows how they interact with exposure, visibility, and legal context using empirical incident data.
X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.
JUST-DUB-IT adapts a joint audio-visual diffusion model via LoRA to generate high-quality dubbed videos with translated audio and lip-synced facial motion.
EchoFake is a new replay-aware dataset combining zero-shot TTS deepfakes and physical replay recordings to improve generalization of speech deepfake detection models over existing lab-focused datasets.
MimicLM achieves better naturalness in zero-shot voice imitation by autoregressively modeling pseudo-parallel data with synthetic sources and real targets, plus interleaved text-audio guidance and preference alignment.
Combining LLM-based elderly-contextual paraphrasing with TTS synthesis using elderly speakers reduces word error rates in elderly ASR by up to 58% over standard Whisper baselines on English and Korean datasets.
CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilingual data.
citing papers explorer
-
Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling
Poly-SVC converts singing voices from polyphonic recordings while keeping melody, lyrics, and harmonies by combining CQT-based pitch extraction with a conditional flow matching diffusion decoder.
-
V.O.I.C.E (Voice, Ownership, Identity, Control, Expression): Risk Taxonomy of Synthetic Voice Generation From Empirical Data
V.O.I.C.E is a new taxonomy that organizes synthetic voice risks into five categories and shows how they interact with exposure, visibility, and legal context using empirical incident data.
-
X-VC: Zero-shot Streaming Voice Conversion in Codec Space
X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.
-
JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion
JUST-DUB-IT adapts a joint audio-visual diffusion model via LoRA to generate high-quality dubbed videos with translated audio and lip-synced facial motion.
-
EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection
EchoFake is a new replay-aware dataset combining zero-shot TTS deepfakes and physical replay recordings to improve generalization of speech deepfake detection models over existing lab-focused datasets.
-
MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora
MimicLM achieves better naturalness in zero-shot voice imitation by autoregressively modeling pseudo-parallel data with synthetic sources and real targets, plus interleaved text-audio guidance and preference alignment.
-
Elderly-Contextual Data Augmentation via Speech Synthesis for Elderly ASR
Combining LLM-based elderly-contextual paraphrasing with TTS synthesis using elderly speakers reduces word error rates in elderly ASR by up to 58% over standard Whisper baselines on English and Korean datasets.
-
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilingual data.
- Talking Slide Avatars: Open-Source Multimodal Communication Approach for Teaching