An Extensive Analysis of the Singing Voice Conversion Challenge 2025 Evaluation Results

Jiatong Shi; Lester Phillip Violeta; Tomoki Toda; Wen-Chin Huang; Xueyao Zhang; Yusuke Yasuda; Zhizheng Wu

arxiv: 2509.15629 · v2 · pith:CW5ETQR4new · submitted 2025-09-19 · 💻 cs.SD · eess.AS

An Extensive Analysis of the Singing Voice Conversion Challenge 2025 Evaluation Results

Lester Phillip Violeta , Xueyao Zhang , Jiatong Shi , Yusuke Yasuda , Wen-Chin Huang , Zhizheng Wu , Tomoki Toda This is my paper

classification 💻 cs.SD eess.AS

keywords challengesingingconversionscoressingerstylesystemstest

0 comments

read the original abstract

We present a thorough analysis of the findings of the latest iteration of the Singing Voice Conversion Challenge, a scientific event aiming to compare and understand different voice conversion systems in a controlled environment. Compared to previous iterations which solely focused on converting the singer identity, this year we also focused on converting the singing style of the singer. To create a controlled environment and thorough evaluations, we developed a new challenge database, introduced two tasks, open-sourced baselines, and conducted large-scale crowd-sourced listening tests and objective evaluations. The challenge was run for two months and in total we evaluated 33 different systems. The results of the large-scale crowd-sourced listening test showed that top systems had comparable singer identity scores to ground truth samples. However, modeling the singing style and consequently achieving high naturalness still remains a challenge in this task, primarily due to the difficulty in modeling dynamic information in breathy, glissando, and vibrato singing styles. Further analyses of the challenge also discuss the limitations of both the traditional similarity test and the dynamic preference test in evaluating singing style similarity. Moreover, calculating Spearman's rank correlation coefficient shows that dependent objective metrics such as chroma-alignment and non-match metrics such as speaker embeddings are the most correlated to subjective scores, but are still not at a level where it could be considered as a true replacement for subjective scores.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Escaping the Linearity Trap: Manifold Detours for Black-Box Adversarial Attacks on Singing Audio Deepfake Detection
cs.CR 2026-05 unverdicted novelty 7.0

MARS is a transfer-based black-box attack that uses bi-level optimization on semantic and artifact anchors to escape the linearity trap and improve attack success rates on SSL-SVDD by up to 36%.
Vibrato Expression Control for Singing Voice Conversion with Improving Independent Control
cs.SD 2026-06 unverdicted novelty 5.0

VibE-SVC2 extends prior singing voice conversion work with new modules for independent pitch-style and timbre-style control, claiming better performance and finer controllability than existing methods.
Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck
cs.SD 2026-04 unverdicted novelty 5.0

A singing voice conversion system with boundary-aware information bottleneck and high-frequency augmentation achieves the best naturalness in SVCC2025 subjective tests while using less extra data than competitors.