DiFlowDubber is the first video dubbing system using a discrete flow matching backbone with two-stage training that pre-trains a zero-shot TTS then adapts it via cross-modal alignment to produce content-consistent, lip-synchronized speech.
Vall-e r: Robust and efficient zero-shot text-to-speech synthesis via monotonic alignment
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 7roles
background 1polarities
background 1representative citing papers
MSpoof-TTS improves zero-shot discrete speech synthesis by integrating multi-resolution token-based spoof detection into a hierarchical decoding process that prunes low-quality candidates.
SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.
CoMelSinger introduces a discrete token-based zero-shot SVS framework on MaskGCT with coarse-to-fine contrastive learning and an SVT module to improve melody control and reduce prosody leakage.
CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and introducing a supervised multi-task speech tokenizer plus a differentiable reward模型.
CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilingual data.
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
citing papers explorer
-
DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization
DiFlowDubber is the first video dubbing system using a discrete flow matching backbone with two-stage training that pre-trains a zero-shot TTS then adapts it via cross-modal alignment to produce content-consistent, lip-synchronized speech.
-
Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection
MSpoof-TTS improves zero-shot discrete speech synthesis by integrating multi-resolution token-based spoof detection into a hierarchical decoding process that prunes low-quality candidates.
-
SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis
SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.
-
CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance
CoMelSinger introduces a discrete token-based zero-shot SVS framework on MaskGCT with coarse-to-fine contrastive learning and an SVT module to improve melody control and reduce prosody leakage.
-
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and introducing a supervised multi-task speech tokenizer plus a differentiable reward模型.
-
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilingual data.
-
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.