EchoFake is a new replay-aware dataset combining zero-shot TTS deepfakes and physical replay recordings to improve generalization of speech deepfake detection models over existing lab-focused datasets.
Xtts: a massively mul- tilingual zero-shot text-to-speech model
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
X-Voice achieves zero-shot cross-lingual voice cloning across 30 languages by using IPA as a unified phonetic representation and a two-stage training process that first generates its own audio prompts then fine-tunes without text.
ProSDD learns speaker-conditioned prosodic variation from real speech via supervised masked prediction and jointly optimizes it with spoof detection, cutting EER substantially on ASVspoof 2024 and emotional datasets.
PS-TTS and PS-Comet TTS use isochrony via language model paraphrasing plus phonetic synchronization with DTW on vowel distances to achieve better lip-sync and semantic preservation in automated dubbing than standard TTS or voice actors on tested language pairs.
citing papers explorer
-
EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection
EchoFake is a new replay-aware dataset combining zero-shot TTS deepfakes and physical replay recordings to improve generalization of speech deepfake detection models over existing lab-focused datasets.
-
X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning
X-Voice achieves zero-shot cross-lingual voice cloning across 30 languages by using IPA as a unified phonetic representation and a two-stage training process that first generates its own audio prompts then fine-tunes without text.
-
ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks
ProSDD learns speaker-conditioned prosodic variation from real speech via supervised masked prediction and jointly optimizes it with spoof detection, cutting EER substantially on ASVspoof 2024 and emotional datasets.
-
PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing
PS-TTS and PS-Comet TTS use isochrony via language model paraphrasing plus phonetic synchronization with DTW on vowel distances to achieve better lip-sync and semantic preservation in automated dubbing than standard TTS or voice actors on tested language pairs.
- Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey