Vall-e r: Robust and efficient zero-shot text-to-speech synthesis via monotonic alignment

Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, Furu Wei · 2024 · arXiv 2406.07855

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization

cs.CV · 2026-03-15 · unverdicted · novelty 7.0

DiFlowDubber is the first video dubbing system using a discrete flow matching backbone with two-stage training that pre-trains a zero-shot TTS then adapts it via cross-modal alignment to produce content-consistent, lip-synchronized speech.

Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection

cs.SD · 2026-03-05 · unverdicted · novelty 7.0

MSpoof-TTS improves zero-shot discrete speech synthesis by integrating multi-resolution token-based spoof detection into a hierarchical decoding process that prunes low-quality candidates.

SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis

eess.AS · 2026-05-16 · unverdicted · novelty 6.0

SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.

CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance

cs.SD · 2025-09-24 · unverdicted · novelty 6.0

CoMelSinger introduces a discrete token-based zero-shot SVS framework on MaskGCT with coarse-to-fine contrastive learning and an SVT module to improve melody control and reduce prosody leakage.

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

cs.SD · 2025-05-23 · unverdicted · novelty 6.0

CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and introducing a supervised multi-task speech tokenizer plus a differentiable reward模型.

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

cs.SD · 2024-12-13 · unverdicted · novelty 5.0

CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilingual data.

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

eess.AS · 2024-10-09 · unverdicted · novelty 5.0

F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.

citing papers explorer

Showing 7 of 7 citing papers.

DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization cs.CV · 2026-03-15 · unverdicted · none · ref 20
DiFlowDubber is the first video dubbing system using a discrete flow matching backbone with two-stage training that pre-trains a zero-shot TTS then adapts it via cross-modal alignment to produce content-consistent, lip-synchronized speech.
Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection cs.SD · 2026-03-05 · unverdicted · none · ref 22
MSpoof-TTS improves zero-shot discrete speech synthesis by integrating multi-resolution token-based spoof detection into a hierarchical decoding process that prunes low-quality candidates.
SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis eess.AS · 2026-05-16 · unverdicted · none · ref 38
SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.
CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance cs.SD · 2025-09-24 · unverdicted · none · ref 46
CoMelSinger introduces a discrete token-based zero-shot SVS framework on MaskGCT with coarse-to-fine contrastive learning and an SVT module to improve melody control and reduce prosody leakage.
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training cs.SD · 2025-05-23 · unverdicted · none · ref 14
CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and introducing a supervised multi-task speech tokenizer plus a differentiable reward模型.
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models cs.SD · 2024-12-13 · unverdicted · none · ref 17
CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilingual data.
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching eess.AS · 2024-10-09 · unverdicted · none · ref 100
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.

Vall-e r: Robust and efficient zero-shot text-to-speech synthesis via monotonic alignment

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer