Xtts: a massively mul- tilingual zero-shot text-to-speech model

Edresson Casanova, Kelly Davis, Eren G¨olge, G¨orkem G¨oknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, et al · 2024 · arXiv 2406.04904

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection

eess.AS · 2025-10-22 · unverdicted · novelty 7.0

EchoFake is a new replay-aware dataset combining zero-shot TTS deepfakes and physical replay recordings to improve generalization of speech deepfake detection models over existing lab-focused datasets.

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

cs.SD · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

X-Voice achieves zero-shot cross-lingual voice cloning across 30 languages by using IPA as a unified phonetic representation and a two-stage training process that first generates its own audio prompts then fine-tunes without text.

ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

eess.AS · 2026-04-14 · unverdicted · novelty 6.0

ProSDD learns speaker-conditioned prosodic variation from real speech via supervised masked prediction and jointly optimizes it with spoof detection, cutting EER substantially on ASVspoof 2024 and emotional datasets.

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

eess.AS · 2026-04-10 · unverdicted · novelty 6.0

PS-TTS and PS-Comet TTS use isochrony via language model paraphrasing plus phonetic synchronization with DTW on vowel distances to achieve better lip-sync and semantic preservation in automated dubbing than standard TTS or voice actors on tested language pairs.

Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey

cs.CV · 2026-04-13

citing papers explorer

Showing 5 of 5 citing papers.

EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection eess.AS · 2025-10-22 · unverdicted · none · ref 18
EchoFake is a new replay-aware dataset combining zero-shot TTS deepfakes and physical replay recordings to improve generalization of speech deepfake detection models over existing lab-focused datasets.
X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning cs.SD · 2026-05-07 · unverdicted · none · ref 21 · 2 links
X-Voice achieves zero-shot cross-lingual voice cloning across 30 languages by using IPA as a unified phonetic representation and a two-stage training process that first generates its own audio prompts then fine-tunes without text.
ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks eess.AS · 2026-04-14 · unverdicted · none · ref 54
ProSDD learns speaker-conditioned prosodic variation from real speech via supervised masked prediction and jointly optimizes it with spoof detection, cutting EER substantially on ASVspoof 2024 and emotional datasets.
PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing eess.AS · 2026-04-10 · unverdicted · none · ref 31
PS-TTS and PS-Comet TTS use isochrony via language model paraphrasing plus phonetic synchronization with DTW on vowel distances to achieve better lip-sync and semantic preservation in automated dubbing than standard TTS or voice actors on tested language pairs.
Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey cs.CV · 2026-04-13 · unreviewed · ref 83

Xtts: a massively mul- tilingual zero-shot text-to-speech model

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer