Vevo2: A unified and controllable frame- work for speech and singing voice generation

· 2025 · arXiv 2508.16332

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

eess.AS · 2026-04-20 · unverdicted · novelty 7.0

MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.

YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance

eess.AS · 2026-03-25 · unverdicted · novelty 7.0

YingMusic-Singer-Plus is a diffusion model for singing voice synthesis that preserves melody from a reference clip while allowing flexible lyric changes without manual alignment, outperforming Vevo2 and introducing the LyricEditBench benchmark.

UniVocal: Unified Speech-Singing Code-Switching Synthesis

cs.SD · 2026-06-01 · unverdicted · novelty 6.0

UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.

Vibrato Expression Control for Singing Voice Conversion with Improving Independent Control

cs.SD · 2026-06-15 · unverdicted · novelty 5.0

VibE-SVC2 extends prior singing voice conversion work with new modules for independent pitch-style and timbre-style control, claiming better performance and finer controllability than existing methods.

UniVoice: A Unified Model for Speech and Singing Voice Generation

cs.SD · 2026-06-04 · unverdicted · novelty 5.0

UniVoice is a conditional flow matching model with a Diffusion Transformer backbone that unifies TTS and SVS via modality-specific encoders and a null melody token for speech, achieving 5.26% speech PER and 16.22% singing PER.

ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

eess.AS · 2026-05-29 · unverdicted · novelty 5.0

ImmersiveTTS proposes an environment-aware TTS system that integrates speech with environmental audio via multimodal diffusion transformer, joint attention, and domain-specific representation alignment, claiming superior naturalness and fidelity.

citing papers explorer

Showing 6 of 6 citing papers after filters.

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech eess.AS · 2026-04-20 · unverdicted · none · ref 45
MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.
YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance eess.AS · 2026-03-25 · unverdicted · none · ref 15
YingMusic-Singer-Plus is a diffusion model for singing voice synthesis that preserves melody from a reference clip while allowing flexible lyric changes without manual alignment, outperforming Vevo2 and introducing the LyricEditBench benchmark.
UniVocal: Unified Speech-Singing Code-Switching Synthesis cs.SD · 2026-06-01 · unverdicted · none · ref 71
UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.
Vibrato Expression Control for Singing Voice Conversion with Improving Independent Control cs.SD · 2026-06-15 · unverdicted · none · ref 25
VibE-SVC2 extends prior singing voice conversion work with new modules for independent pitch-style and timbre-style control, claiming better performance and finer controllability than existing methods.
UniVoice: A Unified Model for Speech and Singing Voice Generation cs.SD · 2026-06-04 · unverdicted · none · ref 18
UniVoice is a conditional flow matching model with a Diffusion Transformer backbone that unifies TTS and SVS via modality-specific encoders and a null melody token for speech, achieving 5.26% speech PER and 16.22% singing PER.
ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment eess.AS · 2026-05-29 · unverdicted · none · ref 72
ImmersiveTTS proposes an environment-aware TTS system that integrates speech with environmental audio via multimodal diffusion transformer, joint attention, and domain-specific representation alignment, claiming superior naturalness and fidelity.

Vevo2: A unified and controllable frame- work for speech and singing voice generation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer