MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.
Vevo2: A unified and controllable frame- work for speech and singing voice generation
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6verdicts
UNVERDICTED 6roles
background 1polarities
background 1representative citing papers
YingMusic-Singer-Plus is a diffusion model for singing voice synthesis that preserves melody from a reference clip while allowing flexible lyric changes without manual alignment, outperforming Vevo2 and introducing the LyricEditBench benchmark.
UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.
VibE-SVC2 extends prior singing voice conversion work with new modules for independent pitch-style and timbre-style control, claiming better performance and finer controllability than existing methods.
UniVoice is a conditional flow matching model with a Diffusion Transformer backbone that unifies TTS and SVS via modality-specific encoders and a null melody token for speech, achieving 5.26% speech PER and 16.22% singing PER.
ImmersiveTTS proposes an environment-aware TTS system that integrates speech with environmental audio via multimodal diffusion transformer, joint attention, and domain-specific representation alignment, claiming superior naturalness and fidelity.
citing papers explorer
-
MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech
MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.
-
YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance
YingMusic-Singer-Plus is a diffusion model for singing voice synthesis that preserves melody from a reference clip while allowing flexible lyric changes without manual alignment, outperforming Vevo2 and introducing the LyricEditBench benchmark.
-
UniVocal: Unified Speech-Singing Code-Switching Synthesis
UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.
-
Vibrato Expression Control for Singing Voice Conversion with Improving Independent Control
VibE-SVC2 extends prior singing voice conversion work with new modules for independent pitch-style and timbre-style control, claiming better performance and finer controllability than existing methods.
-
UniVoice: A Unified Model for Speech and Singing Voice Generation
UniVoice is a conditional flow matching model with a Diffusion Transformer backbone that unifies TTS and SVS via modality-specific encoders and a null melody token for speech, achieving 5.26% speech PER and 16.22% singing PER.
-
ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment
ImmersiveTTS proposes an environment-aware TTS system that integrates speech with environmental audio via multimodal diffusion transformer, joint attention, and domain-specific representation alignment, claiming superior naturalness and fidelity.