SingFox: A Multi-Lingual Singfake Detection Corpus

Arth J. Shah; Devanshi K. Trivedi; Hemant A. Patil; Himanshi U. Borad

arxiv: 2606.18985 · v1 · pith:U4D52CRXnew · submitted 2026-06-17 · 📡 eess.AS

SingFox: A Multi-Lingual Singfake Detection Corpus

Arth J. Shah , Devanshi K. Trivedi , Himanshi U. Borad , Hemant A. Patil This is my paper

Pith reviewed 2026-06-26 19:27 UTC · model grok-4.3

classification 📡 eess.AS

keywords singing deepfake detectionmulti-lingual datasetsingfakesource verificationaudio datasetdeepfake benchmarkreproducibility

0 comments

The pith

SingFox supplies a 113k-clip multi-lingual corpus in six tracks to benchmark singing deepfake detection and source verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SingFox, a dataset of more than 113,802 audio clips spanning 20 languages, 126 hours, and 1,150 singers. It splits the material into six tracks that vary by global and Indian languages, music genres, and methods of generating fakes. The tracks are meant to test how reliably detection models work under different real-world-like conditions. The work also supports a source-verification task for model explainability and reports a peak cross-dataset accuracy of 77.84 percent. All code and data resources are released publicly to promote reproducible research.

Core claim

SingFox is a comprehensive dataset encompassing 113,802 audio clips across 20 languages and 1,150 singers, organized into six tracks that target specific novelties in language diversity, genre-specific music, and alternative fake generation methods to evaluate model robustness in singing deepfake detection and source verification.

What carries the argument

The SingFox dataset divided into six tracks (T1-T6), each targeting a unique form of novelty to emulate real-world scenarios for detection and source verification.

If this is right

Detection models can be evaluated for performance across global and Indian language sets.
Robustness can be measured on genre-specific music and on alternative fake-generation techniques.
The source-verification track enables studies of model explainability alongside detection accuracy.
Public release of the full dataset and code supports direct reproduction of the 77.84 percent cross-dataset result.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models that succeed on SingFox may still require additional data from outside the six tracks to handle entirely new singing styles.
The multi-lingual coverage could expose language-specific weaknesses in existing deepfake detectors that were trained mostly on English material.
Researchers could extend the tracks with new genres or languages while reusing the same evaluation protocol.
The source-verification task opens a route to studying which acoustic features models rely on when they label a clip as fake.

Load-bearing premise

The six tracks sufficiently emulate real-world scenarios to assess model robustness.

What would settle it

A controlled test in which models trained and evaluated on SingFox show low accuracy on newly recorded singing deepfakes drawn from languages or genres outside the six tracks would indicate the benchmark does not capture the claimed robustness.

Figures

Figures reproduced from arXiv: 2606.18985 by Arth J. Shah, Devanshi K. Trivedi, Hemant A. Patil, Himanshi U. Borad.

**Figure 2.** Figure 2: End-to-end SingFox pipeline illustrating data collection, preprocessing, singfake generation, and ground-truth [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Musics Languages in SingFox Dataset as per ISO code. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Model-wise distribution of generated singfakes in T5 track. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Language and music distribution in SingFox dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Model distribution per dataset track [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Train/Val/testing distribution across different tracks of dataset. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 9.** Figure 9: It can be seen that acoustic features, such as LFCC (with BiLSTM as backend classifier) outperforms SSL-based [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 8.** Figure 8: Results on various acoustic features with ResNet classifier on various tracks of proposed singfox dataset. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: DET curve (# singfake trials = 24,997, # genuine trials = 32,741) of T6 [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Distribution on each language, and instrument in each track of dataset. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

In this work, we introduce SingFox, a comprehensive and large-scale dataset specifically designed to support robust evaluation of singing deepfake detection and source tracing systems. SingFox is divided into six distinct tracks (T1--T6), each targeting a unique form of novelty, ranging from language diversity (global and Indian) to genre-specific music and alternative fake generation methods. The dataset encompasses over 113,802 audio clips across 20 languages, totaling more than 126.32 hours of audio data and featuring 1,150 singers. Each track is designed to emulate real-world scenarios and evaluate how reliably models perform under different conditions, thereby assessing their robustness. SingFox aims to foster reproducibility and accelerate research in singing deepfake detection by providing a reliable benchmark for both the singfake detection task and the source verification task (model explainability). Experimental results show a highest accuracy of 77.84\% in cross-dataset evaluation settings. All code and resources required to reproduce the dataset are publicly available at https://github.com/Arth-Shah/SingFox.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SingFox is a new multi-lingual singing deepfake dataset with six tracks and public code, but the paper supplies almost no evidence that the tracks actually test robustness.

read the letter

The main takeaway is that this is a dataset release paper. SingFox contains over 113k clips from 1,150 singers across 20 languages and splits the data into six tracks meant to cover language variety, genres, and different fake generation approaches.

What stands out is the scale and the decision to release the full construction code on GitHub. That makes the resource usable for anyone who wants to run their own experiments. The inclusion of a source verification task alongside basic detection is also a reasonable addition.

The soft spot is the evaluation section. The abstract states a highest accuracy of 77.84% in cross-dataset settings, yet provides no model details, split information, or error bars. More importantly, the central claim that the six tracks emulate real-world conditions is presented without any quantitative check or external validation. For a benchmark dataset, that missing step weakens the main selling point.

This paper is for groups already working on audio deepfake detection who need fresh multi-lingual singing data to test against. A reader focused on dataset construction or applied audio forensics will get direct value from the release itself.

It deserves peer review. Dataset papers can be worth referee time when the data is large and accessible, even if the current validation is thin. Revisions would likely ask for more concrete tests of the track design.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SingFox, a large-scale multi-lingual dataset for singing deepfake detection and source tracing, comprising over 113,802 audio clips (126.32 hours) from 1,150 singers across 20 languages. It is organized into six tracks (T1–T6) targeting language diversity, genre-specific music, and alternative fake generation methods, each intended to emulate real-world conditions for assessing model robustness. The work positions the dataset as a reproducible benchmark for both detection and source verification tasks, reports a peak cross-dataset accuracy of 77.84%, and releases all code and resources publicly.

Significance. If the tracks validly represent real-world distributions and failure modes, the dataset release—with its scale, language coverage, and public reproducibility artifacts—would provide a valuable standardized benchmark that could accelerate progress in singing deepfake detection research.

major comments (2)

[Abstract] Abstract: The claim that the six tracks 'emulate real-world scenarios' and thereby 'assess their robustness' is load-bearing for the central contribution, yet the manuscript supplies no quantitative validation (e.g., distributional statistics, expert listening tests, or comparison against external real-world corpora) to support this emulation.
[Abstract] Experimental results paragraph: The reported 77.84% cross-dataset accuracy is presented without model architecture, training protocol, data splits, or confidence intervals, preventing assessment of whether the number actually demonstrates the benchmark's utility for robustness evaluation.

minor comments (1)

[Abstract] The abstract would benefit from a one-sentence summary of the models or baselines used to obtain the 77.84% figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing SingFox. We address each major comment point-by-point below. We agree that the abstract requires strengthening for the claims made and will revise accordingly to improve clarity and support.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the six tracks 'emulate real-world scenarios' and thereby 'assess their robustness' is load-bearing for the central contribution, yet the manuscript supplies no quantitative validation (e.g., distributional statistics, expert listening tests, or comparison against external real-world corpora) to support this emulation.

Authors: We acknowledge that the abstract's phrasing regarding emulation of real-world scenarios would be strengthened by explicit quantitative support. The tracks are constructed to target documented real-world challenges (language diversity, genre shifts, and alternative generation methods), as described in the dataset construction sections. However, we agree that additional validation is warranted. In the revised manuscript, we will add distributional statistics comparing track characteristics to external real-world singing corpora and, where feasible, expert listening test results to substantiate the design choices. revision: yes
Referee: [Abstract] Experimental results paragraph: The reported 77.84% cross-dataset accuracy is presented without model architecture, training protocol, data splits, or confidence intervals, preventing assessment of whether the number actually demonstrates the benchmark's utility for robustness evaluation.

Authors: The abstract summarizes the peak result at a high level, while the full manuscript details the model architectures, training protocols, data splits, and evaluation methodology in the Experiments section. To address the concern about self-containment, we will revise the abstract to include brief references to these elements and report confidence intervals for the accuracy figures in the updated version. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a dataset release paper whose central contribution is the introduction of SingFox with six tracks, data statistics, and reported cross-dataset accuracy. No equations, derivations, fitted parameters, or mathematical claims appear in the provided abstract or description. The experimental result (77.84% accuracy) is presented as an empirical observation rather than a derived prediction from internal inputs. No self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing steps. The derivation chain is empty by nature of the contribution type, making internal circularity impossible.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset release paper with no mathematical derivations, fitted parameters, axioms, or postulated entities. The contribution rests on the curation and release of audio data.

pith-pipeline@v0.9.1-grok · 5729 in / 1049 out tokens · 16919 ms · 2026-06-26T19:27:48.254953+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 1 linked inside Pith

[1]

Benesty, J

J. Benesty, J. Chen, Y . Huang, and I. Cohen. Pearson Correlation Coefficient. InNoise Reduction in Speech Processing, pages 1–4. Springer, 2009

2009
[2]

J. D. Byrum. Iso 639-1 and iso 639-2: International standards for language codes. iso 15924: International standard for names of scripts. InIFLA Council and General Conference. ERIC, 1999, Bangkok, Thailand

1999
[3]

Casini, L

L. Casini, L. C. Vila, D. Dalmazzo, A.-K. Kaila, and B. L. Sturm. Data-driven analysis of text-conditioned AI-generated music: A case study with suno and udio.arXiv preprint arXiv:2509.11824, 2025, {Last Accessed: 27thF ebruary,2026}

arXiv 2025
[4]

X. Chen, H. Wu, R. Jang, and H.-y. Lee. Singing voice graph modeling for singfake detection. InINTERSPEECH, pages 4843–4847, 2024, Kos Island, Greece

2024
[5]

Chhibber, J

M. Chhibber, J. Mishra, and T. H. Kinnunen. Advancing zero-shot open-set speech deepfake source tracing.arXiv preprint arXiv:2509.24674, 2025, {Last Accessed:27 thF ebruary,2026}

Pith/arXiv arXiv 2025
[6]

K. Cho, B. Van Merriënboer, Ç. Gulçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014, Doha, Qatar

2014
[7]

Comanducci, P

L. Comanducci, P. Bestagini, and S. Tubaro. Fakemusiccaps: A dataset for detection and attribution of synthetic music generated via text-to-music models.Journal of Imaging, 11(7):242, 2025

2025
[8]

Davis and P

S. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences.IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):357–366, 1980. 12 APREPRINT

1980
[9]

A. Firc, M. Chhibber, J. Mishra, V . Pratap Singh, T. Kinnunen, and K. Malinka. STOPA: A dataset of systematic variation of deepfake audio for open-set source tracing and attribution. InINTERSPEECH, pages 1553–1557, 2025, Rotterdam, Netherlands

2025
[10]

J. Han, E. Yang, and U. Oh. Understanding the use of AI-based audio generation models by end-users. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems, volume 355, pages 1–7. ACM, 2024, Hamburg, Germany

2024
[11]

Hermansky and N

H. Hermansky and N. Morgan. RASTA processing of speech.IEEE Transactions on Speech and Audio Processing, 2(4):578–589, 2002

2002
[12]

Y . Hong, J. Feng, H. Chen, J. Lan, H. Zhu, W. Wang, and J. Zhang. Wildfake: A large-scale and hierarchical dataset for AI-generated images detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3500–3508, 2025, Philadelphia, Pennsylvania, USA

2025
[13]

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021

2021
[14]

Huang, J

Z. Huang, J. Hu, X. Li, Y . He, X. Zhao, B. Peng, B. Wu, X. Huang, and G. Cheng. SIDA: Social media image deepfake detection, localization, and explanation with large multimodal model. InComputer Vision and Pattern Recognition Conference (CVPR), pages 28831–28841, 2025, Nashville, Tennessee, USA

2025
[15]

Ito and L

K. Ito and L. Johnson. The LJ Speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017. {Last Accessed:3 rd March, 2026}

2017
[16]

W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim. UnivNet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation. InINTERSPEECH, pages 2207–2211, 2021, Brno, Czechia

2021
[17]

J.-w. Jung, Y . Wu, X. Wang, J.-H. Kim, S. Maiti, Y . Matsunaga, H.-j. Shim, J. Tian, N. Evans, J. S. Chung, et al. SpoofCeleb: Speech deepfake detection and SASV in the wild.IEEE Open Journal of Signal Processing, 6:68–77, 2025

2025
[18]

Klein, T

N. Klein, T. Chen, H. Tak, R. Casal, and E. Khoury. Source tracing of audio deepfake systems. InINTERSPEECH, pages 1100–1104, 2024, Kos Island, Greece

2024
[19]

J. Kong, J. Kim, and J. Bae. HiFi-GAN: Generative adversarial networks for efficient and high-fidelity speech synthesis.Advances in Neural Information Processing Systems (NIPS), Virtual, 33:17022–17033, 2020

2020
[20]

Kubichek

R. Kubichek. Mel cepstral distance measure for objective speech quality assessment. InIEEE Pacific Rim Conference on Communications Computers and Signal Processing (PACRIM), volume 1, pages 125–128, 1993, Victoria, BC, Canada

1993
[21]

LeCun, Y

Y . LeCun, Y . Bengio, and G. Hinton. Deep learning.Nature, 521(7553):436–444, 2015

2015
[22]

S. G. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon. BigVGAN: A universal neural vocoder with large-scale training. In 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 2023

2023
[23]

M. Li, Y . Ahmadiadli, and X.-P. Zhang. A survey on speech deepfake detection.ACM Computing Surveys, 57(7):1–38, 2025

2025
[24]

J. Liu, C. Li, Y . Ren, F. Chen, and Z. Zhao. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. InAAAI conference on Artificial Intelligence, volume 36, pages 11020–11028, 2022, (Virtual) USA

2022
[25]

X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, et al. ASVSpoof 2021: Towards spoofed and deepfake speech detection in the wild.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2507–2522, 2023

2021
[26]

K. T. Mai, S. Bray, T. Davies, and L. D. Griffin. Warning: Humans cannot reliably detect speech deepfakes.PLoS One, 18(8):285–333, 2023

2023
[27]

Müller, P

N. Müller, P. Czempin, F. Diekmann, A. Froghyar, and K. Böttinger. Does audio deepfake detection generalize? InINTERSPEECH, pages 2783–2787, 2022, Incheon, Korea

2022
[28]

N. M. Müller et al. MLAAD: The multi-language audio anti-spoofing dataset. InInternational Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, pages 1–7, 2024

2024
[29]

Negroni, D

V . Negroni, D. Salvi, P. Bestagini, S. Tubaro, et al. Source verification for speech deepfakes. InINTERSPEECH, pages 1–5. 2025, Rotterdam, Netherlands. 13 APREPRINT

2025
[30]

Z. Ning, H. Chen, Y . Jiang, C. Hao, G. Ma, S. Wang, J. Yao, and L. Xie. Diffrhythm: Blazingly fast and embar- rassingly simple end-to-end full-length song generation with latent diffusion.arXiv preprint arXiv:2503.01183, 2025 {Last Accessed:17 th Feb, 2026}

arXiv 2025
[31]

Panayotov, G

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an asr corpus based on public domain audio books. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015, South Brisbane, Queensland, Australia

2015
[32]

Radford, J

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning (ICML), pages 28492–28518, 2023, Honolulu, HI, USA

2023
[33]

M. A. Rahman, Z. I. A. Hakim, N. H. Sarker, B. Paul, and S. A. Fattah. SONICS: Synthetic or not - identifying counterfeit songs. InThe13 th International Conference on Learning Representations, (ICLR), Singapore, 2025

2025
[34]

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. InIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages 749–752, 2001, Salt Lake City, Utah, USA

2001
[35]

GitHub: https://github.com/RVC-Project/Retrieval-based-V oice-Conversion-WebUI.{Last Accessed: 27th, F ebruary,2025}, 2024

RVC-Project. GitHub: https://github.com/RVC-Project/Retrieval-based-V oice-Conversion-WebUI.{Last Accessed: 27th, F ebruary,2025}, 2024

2025
[36]

Siami-Namini, N

S. Siami-Namini, N. Tavakoli, and A. S. Namin. The performance of lstm and bilstm in forecasting time series. In IEEE International Conference on Big Data (Big Data), pages 3285–3292, 2019, Los Angeles, CA, USA

2019
[37]

R. C. Streijl, S. Winkler, and D. S. Hands. Mean Opinion Score (MOS) revisited: Methods and applications, limitations and alternatives.Multimedia Systems, 22(2):213–227, 2016

2016
[38]

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen. An algorithm for intelligibility prediction of time–frequency weighted noisy speech.IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2125–2136, 2011

2011
[39]

Todisco, X

M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee. ASVSpoof 2019: Future horizons in spoofed and fake audio detection. InINTERSPEECH, pages 1008–1012, 2019, Graz, Austria

2019
[40]

X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunen, N. Evans, K. A. Lee, J. Yamagishi, M. Jeong, G. Zhu, Y . Zang, Y . Zhang, S. Maiti, F. Lux, N. Müller, W. Zhang, C. Sun, S. Hou, S. Lyu, S. Le Maguer, C. Gong, H. Guo, L. Chen, and V . Singh. ASVspoof 5: Design, collection and validation of ...

2026
[41]

Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilçi, M. Sahidullah, A. Sizov, N. Evans, M. Todisco, and H. Delgado. ASVSpoof: The automatic speaker verification spoofing and countermeasures challenge.IEEE Journal of Selected Topics in Signal Processing, 11(4):588–604, 2017

2017
[42]

Y . Xie, J. Zhou, X. Lu, Z. Jiang, Y . Yang, H. Cheng, and L. Ye. FSD: An initial Chinese dataset for fake song detection. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4605–4609, 2024, Seoul, Korea

2024
[43]

X. Xuan, Y . Xiao, R. K. Das, and T. Kinnunen. Multilingual source tracing of speech deepfakes: A first benchmark. In 5th Symposium on Security and Privacy in Speech Communicatio (SPSC), pages 27–34, 2025, Delft, Netherlands

2025
[44]

Yamagishi, C

J. Yamagishi, C. Veaux, and K. MacDonald. CSTR VCTK Corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92).The Rainbow Passage which the speakers read out can be found in the International Dialects of English Archive:(http://web. ku. edu/˜ idea/readings/rainbow. htm)., 2019

2019
[45]

Yamagishi, X

J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans, et al. ASVSpoof 2021: Accelerating progress in spoofed and deepfake speech detection. InASVSpoof Workshop, pages 47–54, 2021, Kos Island, Greece

2021
[46]

Z. Yan, T. Yao, S. Chen, Y . Zhao, X. Fu, J. Zhu, D. Luo, C. Wang, S. Ding, Y . Wu, et al. Df40: Toward next- generation deepfake detection.Advances in Neural Information Processing Systems (NeurIPS), 37:29387–29434, 2024, Vancouver, Canada

2024
[47]

Y . Zang, J. Shi, Y . Zhang, R. Yamamoto, J. Han, Y . Tang, S. Xu, W. Zhao, J. Guo, T. Toda, and Z. Duan. CtrSVDD: A benchmark dataset and baseline analysis for controlled singing voice deepfake detection. InINTERSPEECH 2024, Kos, Greece, pages 4783–4787

2024
[48]

Y . Zang, Y . Zhang, M. Heydari, and Z. Duan. Singfake: Singing voice deepfake detection. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12156–12160, 2024, Seoul, Korea. 14 APREPRINT

2024
[49]

Zhang, J

Y . Zhang, J. Cong, H. Xue, L. Xie, P. Zhu, and M. Bi. Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7237–7241, 2022, (Virtual) Singapore

2022
[50]

Zhang, Y

Y . Zhang, Y . Zang, J. Shi, R. Yamamoto, T. Toda, and Z. Duan. SVDD 2024: The inaugural singing voice deepfake detection challenge. InIEEE Spoken Language Technology Workshop (SLT), pages 782–787, 2024, Macao, China

2024
[51]

Zhao and D

X. Zhao and D. Wang. Analyzing noise robustness of MFCC and GFCC features in speaker identification. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 7204–7208, 2013, Vancover, Canada

2013
[52]

T. Zhu, X. Wang, X. Qin, and M. Li. Source tracing: Detecting voice spoofing. InAsia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 216–220, 2022, Chiang Mai, Thailand. A Performance Metrices A.1 Objective Evaluation Metrics 1.Perceptual Evaluation of Speech Quality (PESQ) [34]: This metric estimates...

2022

[1] [1]

Benesty, J

J. Benesty, J. Chen, Y . Huang, and I. Cohen. Pearson Correlation Coefficient. InNoise Reduction in Speech Processing, pages 1–4. Springer, 2009

2009

[2] [2]

J. D. Byrum. Iso 639-1 and iso 639-2: International standards for language codes. iso 15924: International standard for names of scripts. InIFLA Council and General Conference. ERIC, 1999, Bangkok, Thailand

1999

[3] [3]

Casini, L

L. Casini, L. C. Vila, D. Dalmazzo, A.-K. Kaila, and B. L. Sturm. Data-driven analysis of text-conditioned AI-generated music: A case study with suno and udio.arXiv preprint arXiv:2509.11824, 2025, {Last Accessed: 27thF ebruary,2026}

arXiv 2025

[4] [4]

X. Chen, H. Wu, R. Jang, and H.-y. Lee. Singing voice graph modeling for singfake detection. InINTERSPEECH, pages 4843–4847, 2024, Kos Island, Greece

2024

[5] [5]

Chhibber, J

M. Chhibber, J. Mishra, and T. H. Kinnunen. Advancing zero-shot open-set speech deepfake source tracing.arXiv preprint arXiv:2509.24674, 2025, {Last Accessed:27 thF ebruary,2026}

Pith/arXiv arXiv 2025

[6] [6]

K. Cho, B. Van Merriënboer, Ç. Gulçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014, Doha, Qatar

2014

[7] [7]

Comanducci, P

L. Comanducci, P. Bestagini, and S. Tubaro. Fakemusiccaps: A dataset for detection and attribution of synthetic music generated via text-to-music models.Journal of Imaging, 11(7):242, 2025

2025

[8] [8]

Davis and P

S. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences.IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):357–366, 1980. 12 APREPRINT

1980

[9] [9]

A. Firc, M. Chhibber, J. Mishra, V . Pratap Singh, T. Kinnunen, and K. Malinka. STOPA: A dataset of systematic variation of deepfake audio for open-set source tracing and attribution. InINTERSPEECH, pages 1553–1557, 2025, Rotterdam, Netherlands

2025

[10] [10]

J. Han, E. Yang, and U. Oh. Understanding the use of AI-based audio generation models by end-users. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems, volume 355, pages 1–7. ACM, 2024, Hamburg, Germany

2024

[11] [11]

Hermansky and N

H. Hermansky and N. Morgan. RASTA processing of speech.IEEE Transactions on Speech and Audio Processing, 2(4):578–589, 2002

2002

[12] [12]

Y . Hong, J. Feng, H. Chen, J. Lan, H. Zhu, W. Wang, and J. Zhang. Wildfake: A large-scale and hierarchical dataset for AI-generated images detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3500–3508, 2025, Philadelphia, Pennsylvania, USA

2025

[13] [13]

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021

2021

[14] [14]

Huang, J

Z. Huang, J. Hu, X. Li, Y . He, X. Zhao, B. Peng, B. Wu, X. Huang, and G. Cheng. SIDA: Social media image deepfake detection, localization, and explanation with large multimodal model. InComputer Vision and Pattern Recognition Conference (CVPR), pages 28831–28841, 2025, Nashville, Tennessee, USA

2025

[15] [15]

Ito and L

K. Ito and L. Johnson. The LJ Speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017. {Last Accessed:3 rd March, 2026}

2017

[16] [16]

W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim. UnivNet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation. InINTERSPEECH, pages 2207–2211, 2021, Brno, Czechia

2021

[17] [17]

J.-w. Jung, Y . Wu, X. Wang, J.-H. Kim, S. Maiti, Y . Matsunaga, H.-j. Shim, J. Tian, N. Evans, J. S. Chung, et al. SpoofCeleb: Speech deepfake detection and SASV in the wild.IEEE Open Journal of Signal Processing, 6:68–77, 2025

2025

[18] [18]

Klein, T

N. Klein, T. Chen, H. Tak, R. Casal, and E. Khoury. Source tracing of audio deepfake systems. InINTERSPEECH, pages 1100–1104, 2024, Kos Island, Greece

2024

[19] [19]

J. Kong, J. Kim, and J. Bae. HiFi-GAN: Generative adversarial networks for efficient and high-fidelity speech synthesis.Advances in Neural Information Processing Systems (NIPS), Virtual, 33:17022–17033, 2020

2020

[20] [20]

Kubichek

R. Kubichek. Mel cepstral distance measure for objective speech quality assessment. InIEEE Pacific Rim Conference on Communications Computers and Signal Processing (PACRIM), volume 1, pages 125–128, 1993, Victoria, BC, Canada

1993

[21] [21]

LeCun, Y

Y . LeCun, Y . Bengio, and G. Hinton. Deep learning.Nature, 521(7553):436–444, 2015

2015

[22] [22]

S. G. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon. BigVGAN: A universal neural vocoder with large-scale training. In 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 2023

2023

[23] [23]

M. Li, Y . Ahmadiadli, and X.-P. Zhang. A survey on speech deepfake detection.ACM Computing Surveys, 57(7):1–38, 2025

2025

[24] [24]

J. Liu, C. Li, Y . Ren, F. Chen, and Z. Zhao. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. InAAAI conference on Artificial Intelligence, volume 36, pages 11020–11028, 2022, (Virtual) USA

2022

[25] [25]

X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, et al. ASVSpoof 2021: Towards spoofed and deepfake speech detection in the wild.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2507–2522, 2023

2021

[26] [26]

K. T. Mai, S. Bray, T. Davies, and L. D. Griffin. Warning: Humans cannot reliably detect speech deepfakes.PLoS One, 18(8):285–333, 2023

2023

[27] [27]

Müller, P

N. Müller, P. Czempin, F. Diekmann, A. Froghyar, and K. Böttinger. Does audio deepfake detection generalize? InINTERSPEECH, pages 2783–2787, 2022, Incheon, Korea

2022

[28] [28]

N. M. Müller et al. MLAAD: The multi-language audio anti-spoofing dataset. InInternational Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, pages 1–7, 2024

2024

[29] [29]

Negroni, D

V . Negroni, D. Salvi, P. Bestagini, S. Tubaro, et al. Source verification for speech deepfakes. InINTERSPEECH, pages 1–5. 2025, Rotterdam, Netherlands. 13 APREPRINT

2025

[30] [30]

Z. Ning, H. Chen, Y . Jiang, C. Hao, G. Ma, S. Wang, J. Yao, and L. Xie. Diffrhythm: Blazingly fast and embar- rassingly simple end-to-end full-length song generation with latent diffusion.arXiv preprint arXiv:2503.01183, 2025 {Last Accessed:17 th Feb, 2026}

arXiv 2025

[31] [31]

Panayotov, G

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an asr corpus based on public domain audio books. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015, South Brisbane, Queensland, Australia

2015

[32] [32]

Radford, J

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning (ICML), pages 28492–28518, 2023, Honolulu, HI, USA

2023

[33] [33]

M. A. Rahman, Z. I. A. Hakim, N. H. Sarker, B. Paul, and S. A. Fattah. SONICS: Synthetic or not - identifying counterfeit songs. InThe13 th International Conference on Learning Representations, (ICLR), Singapore, 2025

2025

[34] [34]

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. InIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages 749–752, 2001, Salt Lake City, Utah, USA

2001

[35] [35]

GitHub: https://github.com/RVC-Project/Retrieval-based-V oice-Conversion-WebUI.{Last Accessed: 27th, F ebruary,2025}, 2024

RVC-Project. GitHub: https://github.com/RVC-Project/Retrieval-based-V oice-Conversion-WebUI.{Last Accessed: 27th, F ebruary,2025}, 2024

2025

[36] [36]

Siami-Namini, N

S. Siami-Namini, N. Tavakoli, and A. S. Namin. The performance of lstm and bilstm in forecasting time series. In IEEE International Conference on Big Data (Big Data), pages 3285–3292, 2019, Los Angeles, CA, USA

2019

[37] [37]

R. C. Streijl, S. Winkler, and D. S. Hands. Mean Opinion Score (MOS) revisited: Methods and applications, limitations and alternatives.Multimedia Systems, 22(2):213–227, 2016

2016

[38] [38]

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen. An algorithm for intelligibility prediction of time–frequency weighted noisy speech.IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2125–2136, 2011

2011

[39] [39]

Todisco, X

M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee. ASVSpoof 2019: Future horizons in spoofed and fake audio detection. InINTERSPEECH, pages 1008–1012, 2019, Graz, Austria

2019

[40] [40]

X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunen, N. Evans, K. A. Lee, J. Yamagishi, M. Jeong, G. Zhu, Y . Zang, Y . Zhang, S. Maiti, F. Lux, N. Müller, W. Zhang, C. Sun, S. Hou, S. Lyu, S. Le Maguer, C. Gong, H. Guo, L. Chen, and V . Singh. ASVspoof 5: Design, collection and validation of ...

2026

[41] [41]

Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilçi, M. Sahidullah, A. Sizov, N. Evans, M. Todisco, and H. Delgado. ASVSpoof: The automatic speaker verification spoofing and countermeasures challenge.IEEE Journal of Selected Topics in Signal Processing, 11(4):588–604, 2017

2017

[42] [42]

Y . Xie, J. Zhou, X. Lu, Z. Jiang, Y . Yang, H. Cheng, and L. Ye. FSD: An initial Chinese dataset for fake song detection. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4605–4609, 2024, Seoul, Korea

2024

[43] [43]

X. Xuan, Y . Xiao, R. K. Das, and T. Kinnunen. Multilingual source tracing of speech deepfakes: A first benchmark. In 5th Symposium on Security and Privacy in Speech Communicatio (SPSC), pages 27–34, 2025, Delft, Netherlands

2025

[44] [44]

Yamagishi, C

J. Yamagishi, C. Veaux, and K. MacDonald. CSTR VCTK Corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92).The Rainbow Passage which the speakers read out can be found in the International Dialects of English Archive:(http://web. ku. edu/˜ idea/readings/rainbow. htm)., 2019

2019

[45] [45]

Yamagishi, X

J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans, et al. ASVSpoof 2021: Accelerating progress in spoofed and deepfake speech detection. InASVSpoof Workshop, pages 47–54, 2021, Kos Island, Greece

2021

[46] [46]

Z. Yan, T. Yao, S. Chen, Y . Zhao, X. Fu, J. Zhu, D. Luo, C. Wang, S. Ding, Y . Wu, et al. Df40: Toward next- generation deepfake detection.Advances in Neural Information Processing Systems (NeurIPS), 37:29387–29434, 2024, Vancouver, Canada

2024

[47] [47]

Y . Zang, J. Shi, Y . Zhang, R. Yamamoto, J. Han, Y . Tang, S. Xu, W. Zhao, J. Guo, T. Toda, and Z. Duan. CtrSVDD: A benchmark dataset and baseline analysis for controlled singing voice deepfake detection. InINTERSPEECH 2024, Kos, Greece, pages 4783–4787

2024

[48] [48]

Y . Zang, Y . Zhang, M. Heydari, and Z. Duan. Singfake: Singing voice deepfake detection. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12156–12160, 2024, Seoul, Korea. 14 APREPRINT

2024

[49] [49]

Zhang, J

Y . Zhang, J. Cong, H. Xue, L. Xie, P. Zhu, and M. Bi. Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7237–7241, 2022, (Virtual) Singapore

2022

[50] [50]

Zhang, Y

Y . Zhang, Y . Zang, J. Shi, R. Yamamoto, T. Toda, and Z. Duan. SVDD 2024: The inaugural singing voice deepfake detection challenge. InIEEE Spoken Language Technology Workshop (SLT), pages 782–787, 2024, Macao, China

2024

[51] [51]

Zhao and D

X. Zhao and D. Wang. Analyzing noise robustness of MFCC and GFCC features in speaker identification. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 7204–7208, 2013, Vancover, Canada

2013

[52] [52]

T. Zhu, X. Wang, X. Qin, and M. Li. Source tracing: Detecting voice spoofing. InAsia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 216–220, 2022, Chiang Mai, Thailand. A Performance Metrices A.1 Objective Evaluation Metrics 1.Perceptual Evaluation of Speech Quality (PESQ) [34]: This metric estimates...

2022