Escaping the Linearity Trap: Manifold Detours for Black-Box Adversarial Attacks on Singing Audio Deepfake Detection

Jiaheng Wei; Xinhu Zheng; Xinlei He; Yifan Liao; Yule Liu; Yupeng He; Zhen Sun; Zongmin Zhang

arxiv: 2605.30366 · v1 · pith:HIVGUAWKnew · submitted 2026-05-18 · 💻 cs.CR · cs.SD· eess.AS

Escaping the Linearity Trap: Manifold Detours for Black-Box Adversarial Attacks on Singing Audio Deepfake Detection

Yifan Liao , Yule Liu , Zhen Sun , Zongmin Zhang , Yupeng He , Jiaheng Wei , Xinhu Zheng , Xinlei He This is my paper

Pith reviewed 2026-06-30 18:56 UTC · model grok-4.3

classification 💻 cs.CR cs.SDeess.AS

keywords singing voice deepfake detectionadversarial attacksblack-box attacksself-supervised learningtransferabilitylinearity trapMARSSVDD

0 comments

The pith

MARS uses bi-level optimization to guide adversarial attacks on singing deepfake detectors toward the natural semantic manifold, escaping the linearity trap and improving transfer success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard adversarial attacks on SSL-based singing voice deepfake detectors get stuck in the linearity trap by following surrogate-specific gradients that target fine-tuned artifacts rather than shared spoof evidence. To address this, it introduces MARS, which constructs anchors in pre-trained and fine-tuned SSL spaces and applies bi-level optimization: inner tangential exploration combined with outer guidance to the natural semantic manifold. This shift allows attacks to suppress common spoof cues more effectively, leading to better black-box transfer. If correct, this means current detectors are less robust than thought, as targeted attacks can be made more transferable. The method is evaluated on the CtrSVDD benchmark showing gains in multiple transfer scenarios.

Core claim

Existing attacks fail because they optimize cross-entropy on local surrogates and follow dominant gradient directions aligned with fine-tuned artifact-sensitive features, a geometric issue called the Linearity Trap. MARS escapes this by shifting to hypothesis-evidence manipulation using a natural semantic anchor from pre-trained SSL space and an artifact anchor from fine-tuned space, with bi-level optimization where the inner stage induces tangential exploration and the outer stage guides toward the natural semantic manifold.

What carries the argument

MARS (Meta-Adversarial Regression of Semantics), a bi-level optimization framework that performs inner tangential exploration and outer guidance to the natural semantic manifold using anchors from pre-trained and fine-tuned SSL spaces.

If this is right

MARS achieves higher attack success rates in in-distribution transfer by 13% on CtrSVDD.
Out-of-distribution transfer improves by 10%.
Cross-task evaluation shows 36% better ASR.
The approach highlights that SSL-SVDD systems require more robust defenses against manifold-aware attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar manifold-detour techniques might improve attack transferability in other audio or image deepfake detection tasks.
If the linearity trap is general, then many fine-tuned SSL models in security applications could be vulnerable to attacks that target pre-trained representations.
Detectors could be hardened by incorporating manifold regularization during training to resist such detours.

Load-bearing premise

The bi-level optimization will successfully suppress shared spoof evidence by guiding to the natural semantic manifold without being trapped by surrogate-specific directions.

What would settle it

Running standard gradient-based attacks versus MARS on the CtrSVDD benchmark and finding no improvement or lower ASR in transfer settings would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.30366 by Jiaheng Wei, Xinhu Zheng, Xinlei He, Yifan Liao, Yule Liu, Yupeng He, Zhen Sun, Zongmin Zhang.

**Figure 2.** Figure 2: Overview of the proposed Meta-Adversarial Semantic [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Optimization Dynamics on S 2 under varying equilibrium factor γ, with higher γ representing higher repulsion. We measure their directional interaction by A pp t = cos(dpull,dpush) = d ⊤ pulldpush ∥dpull∥2∥dpush∥2 . (13) A strongly negative value indicates conflict between the two evidence directions, whereas a value closer to zero indicates weaker directional interference. We refer to the resulting instab… view at source ↗

**Figure 4.** Figure 4: Transferability of adversarial attacks on the OOD [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study on loss functions for the MultiConv [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation study on γ and step for the MultiConv detector. turbations. The improvement is especially significant when increasing ε from 0.001 to 0.005, where most models exhibit a sharp increase in ASR. However, after ε = 0.01, the performance gain becomes marginal: most ASR curves start to saturate, indicating that further enlarging the perturbation budget yields limited additional attack benefit. Meanwhi… view at source ↗

**Figure 8.** Figure 8: Ablation study on the ε Selection. This structural heterogeneity creates orthogonal feature constraints, forcing the adversary to bypass the discriminative logic of both paradigms simultaneously. • The Risk of Destructive Interference: Excessive divergence can be detrimental. The pairing of English-centric WavLMB with multilingual Wav2vec-X results in severe misalignment, leading to destructive gradient… view at source ↗

**Figure 9.** Figure 9: Robustness evaluation of our method across eight SSL [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Mel spectrogram comparison. (a) The original audio sample. (b) The generated adversarial sample. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Mechanism ablation of the proposed bi-level detour [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

Recent Singing Voice Synthesis (SVS) advances enable highly realistic but potentially malicious AI covers, making singing voice deepfake detection (SVDD) crucial. Self-Supervised Learning (SSL)-based detectors achieve state-of-the-art performance by fine-tuning speech SSL backbones to capture singing-specific spoof artifacts. Existing adversarial attacks often fail against SSL-SVDD, creating a false impression of inherent robustness. We reveal this stems from two challenges. First, at the objective level, attacks optimize cross-entropy on local surrogates, crossing surrogate-specific boundaries rather than suppressing shared spoof evidence. Second, at the method level, attacks follow the surrogate's dominant gradient direction. In SSL-SVDD, this aligns with fine-tuned artifact-sensitive directions, limiting transferability to unseen detectors - a geometric failure we term the Linearity Trap. To properly evaluate robustness, we propose MARS (Meta-Adversarial Regression of Semantics), a transfer-based black-box framework tailored to SSL-SVDD. Structurally, MARS shifts to hypothesis-evidence manipulation by constructing a natural semantic anchor from the pre-trained SSL space and an artifact anchor from the fine-tuned space. Algorithmically, MARS escapes the Linearity Trap via bi-level optimization: the inner stage induces tangential exploration, while the outer stage guides the audio toward the natural semantic manifold. Experiments on the CtrSVDD benchmark show MARS improves Attack Success Rate (ASR) in in-distribution transfer (13%), out-of-distribution transfer (10%), and cross-task evaluation (36%), highlighting the urgent need for robust SVDD systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames a Linearity Trap in SSL-SVDD attacks and offers MARS with bi-level optimization on semantic/artifact anchors, but the abstract leaves the mechanism unverified.

read the letter

The main takeaway is that existing black-box attacks on SSL-based singing voice deepfake detectors get stuck optimizing surrogate-specific directions instead of suppressing shared spoof cues, and the authors propose MARS to escape via bi-level optimization that pulls toward a natural semantic manifold using anchors from pre-trained and fine-tuned SSL spaces.

What is new is the tailored construction for singing audio: natural semantic anchors from the pre-trained space and artifact anchors from the fine-tuned space, combined with inner tangential exploration and outer guidance. The reported ASR lifts on CtrSVDD (13% in-distribution transfer, 10% out-of-distribution, 36% cross-task) are concrete numbers on an external benchmark.

The framing of the problem as a geometric failure rather than just poor surrogate choice is a useful shift, and the hypothesis-evidence manipulation angle distinguishes it from standard cross-entropy attacks.

The soft spots are in the verification of the core claim. The abstract gives no equations, no embedding diagnostics, and no analysis showing that the resulting perturbations actually land closer to the natural manifold or suppress detector-shared spoof evidence. The stress-test concern lands: without those checks it is possible the gains come from higher effective query budget or surrogate choice rather than the hypothesized escape from the trap.

Experiments are described at high level only, with no baselines, ablations, error bars, or statistical tests mentioned. That makes it difficult to judge whether the method is responsible for the improvements.

This is for people working on adversarial robustness for audio deepfakes, especially transfer attacks against SSL models. A reader focused on singing voice synthesis security could extract the anchor-and-bi-level idea if the full paper supplies the missing implementation details.

It deserves peer review so referees can examine the actual optimization procedure and any manifold-distance measurements. The angle is distinct enough to warrant that step even if the current evidence is thin.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes MARS (Meta-Adversarial Regression of Semantics), a transfer-based black-box attack framework for SSL-based singing voice deepfake detectors. It diagnoses the 'Linearity Trap' as attacks following surrogate-specific gradients aligned with fine-tuned artifact directions rather than suppressing shared spoof evidence. MARS constructs a natural semantic anchor from pre-trained SSL space and an artifact anchor from fine-tuned space, then applies bi-level optimization (inner tangential exploration, outer manifold guidance) to improve transferability, claiming ASR gains of 13% (in-distribution), 10% (out-of-distribution), and 36% (cross-task) on the CtrSVDD benchmark.

Significance. If the central geometric mechanism is validated, the work would be significant for adversarial robustness evaluation in audio deepfake detection, especially for singing voice synthesis threats. It offers a manifold-based alternative to standard surrogate attacks and reports concrete transfer gains that could inform defense design. The explicit use of pre-trained vs. fine-tuned SSL spaces for anchor construction is a potentially reusable idea if supported by diagnostics.

major comments (3)

[Method (bi-level optimization description)] The core hypothesis that pre-trained SSL provides a 'natural semantic anchor' distinct from artifact-sensitive directions in the fine-tuned space (and that this suppresses shared spoof evidence) lacks any embedding-space diagnostics, distance metrics, or visualization in the experiments. This assumption is load-bearing for the claim that MARS escapes the Linearity Trap rather than simply altering surrogate gradients.
[Experiments section] The reported ASR improvements (13/10/36%) are presented without error bars, statistical significance tests, or ablation studies that isolate the bi-level components (inner tangential exploration and outer manifold guidance) from confounds such as total query budget or surrogate selection. Without these, attribution to the proposed geometric escape cannot be verified.
[Algorithmic description of MARS] No convergence analysis, perturbation norm bounds, or content-preservation metrics (e.g., perceptual similarity to original singing audio) are provided for the bi-level optimization, leaving open whether the output lands closer to the natural manifold while remaining a valid adversarial example.

minor comments (1)

[Abstract and introduction] The term 'hypothesis-evidence manipulation' is introduced in the abstract but not formally defined or connected to the anchor construction in the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the validation of our claims.

read point-by-point responses

Referee: [Method (bi-level optimization description)] The core hypothesis that pre-trained SSL provides a 'natural semantic anchor' distinct from artifact-sensitive directions in the fine-tuned space (and that this suppresses shared spoof evidence) lacks any embedding-space diagnostics, distance metrics, or visualization in the experiments. This assumption is load-bearing for the claim that MARS escapes the Linearity Trap rather than simply altering surrogate gradients.

Authors: We agree that direct embedding-space diagnostics would strengthen the geometric interpretation. The manuscript uses the observed ASR gains in cross-task transfer (36%) as indirect evidence that the anchors are distinct and that the bi-level optimization suppresses surrogate-specific artifact directions rather than merely altering gradients. In revision, we will add t-SNE visualizations of the pre-trained versus fine-tuned SSL spaces and cosine similarity metrics between the semantic and artifact anchors to provide explicit support for the hypothesis. revision: yes
Referee: [Experiments section] The reported ASR improvements (13/10/36%) are presented without error bars, statistical significance tests, or ablation studies that isolate the bi-level components (inner tangential exploration and outer manifold guidance) from confounds such as total query budget or surrogate selection. Without these, attribution to the proposed geometric escape cannot be verified.

Authors: The referee correctly notes a gap in experimental rigor. The reported gains are point estimates without measures of variability or component isolation. We will revise the experiments section to include error bars from multiple independent runs, statistical significance tests on the ASR differences, and targeted ablations that vary the inner and outer optimization stages independently while controlling for query budget and surrogate choice. revision: yes
Referee: [Algorithmic description of MARS] No convergence analysis, perturbation norm bounds, or content-preservation metrics (e.g., perceptual similarity to original singing audio) are provided for the bi-level optimization, leaving open whether the output lands closer to the natural manifold while remaining a valid adversarial example.

Authors: We acknowledge that the current manuscript omits these analyses for the bi-level optimization procedure. The framework is intended to produce perturbations that remain on or near the natural manifold, but without reported diagnostics this property is not explicitly verified. In the revised version, we will include convergence plots for the inner and outer loops, L2 perturbation norm statistics, and content-preservation metrics such as perceptual similarity scores to confirm that adversarial examples preserve singing audio quality. revision: yes

Circularity Check

0 steps flagged

No circularity; new framework with external benchmark validation

full rationale

The paper presents MARS as a novel bi-level optimization framework that constructs semantic and artifact anchors from pre-trained and fine-tuned SSL spaces, then applies inner tangential exploration and outer manifold guidance. The central claims rest on algorithmic novelty and reported ASR gains on the external CtrSVDD benchmark rather than any reduction of outputs to fitted inputs, self-definitional constructs, or load-bearing self-citations. No equations or derivations in the provided text reduce predictions to the method's own definitions by construction, and the derivation chain remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Based solely on the abstract, the central claim rests on the existence of a shared spoof evidence across models and the effectiveness of the proposed anchors and optimization; several new concepts are introduced without independent evidence outside the paper.

axioms (1)

domain assumption SSL models possess a pre-trained space containing natural semantics separable from a fine-tuned space containing artifact-sensitive directions
Invoked to justify construction of the two anchors in MARS.

invented entities (3)

natural semantic anchor no independent evidence
purpose: Point in pre-trained SSL space used as target for natural semantics
Newly introduced construct for the attack framework
artifact anchor no independent evidence
purpose: Point in fine-tuned space used to represent spoof artifacts
Newly introduced construct for the attack framework
Linearity Trap no independent evidence
purpose: Geometric explanation for poor transferability of gradient-based attacks
Term coined in the paper to describe the failure mode

pith-pipeline@v0.9.1-grok · 5851 in / 1593 out tokens · 38756 ms · 2026-06-30T18:56:05.287817+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition
cs.SD 2026-06 unverdicted novelty 7.0

Introduces a feature-vocoder adversarial attack on ASR using SSL representations that reports +26.6 WER black-box transfer and +36.2 WER defense resistance over baselines.

Reference graph

Works this paper leans on

65 extracted references · 25 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Diffsinger: Singing voice synthesis via shallow diffusion mechanism,

J. Liu, C. Li, Y . Ren, F. Chen, and Z. Zhao, “Diffsinger: Singing voice synthesis via shallow diffusion mechanism,” inThirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI...

2022
[2]

11 020–11 028

AAAI Press, 2022, pp. 11 020–11 028. [Online]. Available: https://doi.org/10.1609/aaai.v36i10.21350 1

work page doi:10.1609/aaai.v36i10.21350 2022
[3]

Nnsvs: A neural network-based singing voice synthesis toolkit,

R. Yamamoto, R. Yoneyama, and T. Toda, “Nnsvs: A neural network-based singing voice synthesis toolkit,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. 1

2023
[4]

Muskits: an end-to-end music processing toolkit for singing voice synthesis,

J. Shi, S. Guo, T. Qian, N. Huo, T. Hayashi, Y . Wu, F. Xu, X. Chang, H. Li, P. Wuet al., “Muskits: an end-to-end music processing toolkit for singing voice synthesis,” arXiv preprint arXiv:2205.04029, 2022. 1

work page arXiv 2022
[5]

(2023) The ai-generated song that sounds just like drake

BBC News. (2023) The ai-generated song that sounds just like drake. Accessed: 2026-01-29. [Online]. Available: https://www.bbc.com/news/articles/ cwyd3r62kp5o 1

2023
[6]

A review of modern audio deepfake detection methods: challenges and future directions,

Z. Almutairi and H. Elgibreen, “A review of modern audio deepfake detection methods: challenges and future directions,”Algorithms, vol. 15, no. 5, p. 155, 2022. 1

2022
[7]

Deepfake media forensics: Sta- tus and future challenges,

I. Amerini, M. Barni, S. Battiato, P. Bestagini, G. Boato, V . Bruni, R. Caldelli, F. De Natale, R. De Nicola, L. Guarneraet al., “Deepfake media forensics: Sta- tus and future challenges,”Journal of Imaging, vol. 11, no. 3, p. 73, 2025. 1

2025
[8]

Svdd 2024: The inaugural singing voice deep- fake detection challenge,

Y . Zhang, Y . Zang, J. Shi, R. Yamamoto, T. Toda, and Z. Duan, “Svdd 2024: The inaugural singing voice deep- fake detection challenge,” in2024 IEEE Spoken Lan- guage Technology Workshop (SLT). IEEE, 2024, pp. 782–787. 1

2024
[9]

Singfake: Singing voice deepfake detection,

Y . Zang, Y . Zhang, M. Heydari, and Z. Duan, “Singfake: Singing voice deepfake detection,” inICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 156–12 160. 1

2024
[10]

Xwsb: A blend system utilizing xls-r and wavlm with sls classifier detection system for svdd 2024 challenge,

Q. Zhang, S. Wen, F. Yan, T. Hu, and J. Li, “Xwsb: A blend system utilizing xls-r and wavlm with sls classifier detection system for svdd 2024 challenge,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 788–794. 1, 8

2024
[11]

Black- box attacks on spoofing countermeasures using transfer- ability of adversarial examples

Y . Zhang, Z. Jiang, J. Villalba, and N. Dehak, “Black- box attacks on spoofing countermeasures using transfer- ability of adversarial examples.” inInterspeech, 2020, pp. 4238–4242. 1, 2

2020
[12]

Transferable adversarial attacks on audio deepfake de- tection,

M. U. Farooq, A. Khan, K. Uddin, and K. M. Malik, “Transferable adversarial attacks on audio deepfake de- tection,” inProceedings of the Winter Conference on Applications of Computer Vision, 2025, pp. 1640–1649. 1, 2

2025
[13]

Nesterov accelerated gradient and scale invariance for adversarial attacks,

J. Lin, C. Song, K. He, L. Wang, and J. E. Hopcroft, “Nesterov accelerated gradient and scale invariance for adversarial attacks,”arXiv preprint arXiv:1908.06281,

work page arXiv 1908
[14]

Enhancing the transferability of adversarial attacks through variance tuning,

X. Wang and K. He, “Enhancing the transferability of adversarial attacks through variance tuning,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1924–1933. 1, 2, 3, 8

2021
[15]

Boosting adver- sarial transferability via negative hessian trace regular- ization,

Y . Long, Z. Tian, L. Zhang, and H. Xu, “Boosting adver- sarial transferability via negative hessian trace regular- ization,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 2386–2395. 1, 2, 3

2025
[16]

Improving transferability of adversarial examples with input diversity,

C. Xie, Z. Zhang, Y . Zhou, S. Bai, J. Wang, Z. Ren, and A. L. Yuille, “Improving transferability of adversarial examples with input diversity,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2730–2739. 1, 2, 3, 8

2019
[17]

Evading defenses to transferable adversarial examples by translation- invariant attacks,

Y . Dong, T. Pang, H. Su, and J. Zhu, “Evading defenses to transferable adversarial examples by translation- invariant attacks,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2019, pp. 4312–4321. 1, 2, 3, 8

2019
[18]

Improving the transferability of adversarial samples with adversarial transformations,

W. Wu, Y . Su, M. R. Lyu, and I. King, “Improving the transferability of adversarial samples with adversarial transformations,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2021, pp. 9024–9033. 1, 2, 3

2021
[19]

Enhancing adversarial transfer- ability with adversarial weight tuning,

J. Chen, Z. Feng, R. Zeng, Y . Pu, C. Zhou, Y . Jiang, Y . Gan, J. Li, and S. Ji, “Enhancing adversarial transfer- ability with adversarial weight tuning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 2, 2025, pp. 2061–2069. 1, 2, 3, 8

2025
[20]

Ix. on the problem of the most efficient tests of statistical hypotheses,

J. Neyman and E. S. Pearson, “Ix. on the problem of the most efficient tests of statistical hypotheses,”Philosoph- ical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, vol. 231, no. 694-706, pp. 289–337, 1933. 2, 4

1933
[21]

Diffsvc: A diffusion probabilistic model for singing voice conversion,

S. Liu, Y . Cao, D. Su, and H. Meng, “Diffsvc: A diffusion probabilistic model for singing voice conversion,” inIEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021. IEEE, 2021, pp. 14 741–748. [Online]. Available: https://doi.org/10.1109/ ASRU51503.2021.9688219 2

work page arXiv 2021
[22]

A hierarchical speaker representation framework for one-shot singing voice conversion,

X. Li, S. Liu, and Y . Shan, “A hierarchical speaker representation framework for one-shot singing voice conversion,” in23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, September 18- 22, 2022, H. Ko and J. H. L. Hansen, Eds. ISCA, 2022, pp. 4307–4311. [Online]. Available: https://doi.org/10.21...

work page doi:10.21437/interspeech.2022-11305 2022
[23]

Improving ad- versarial waveform generation based singing voice con- version with harmonic signals,

H. Guo, Z. Zhou, F. Meng, and K. Liu, “Improving ad- versarial waveform generation based singing voice con- version with harmonic signals,” inIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May

2022
[24]

ADD 2022: The first audio deep synthesis detection challenge,

IEEE, 2022, pp. 6657–6661. [Online]. Available: https://doi.org/10.1109/ICASSP43922.2022.9746709 2

work page doi:10.1109/icassp43922.2022.9746709 2022
[25]

The singing voice conversion challenge 2023,

W. Huang, L. P. Violeta, S. Liu, J. Shi, Y . Yasuda, and T. Toda, “The singing voice conversion challenge 2023,” CoRR, vol. abs/2306.14422, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.14422 2

work page doi:10.48550/arxiv.2306.14422 2023
[26]

An Extensive Analysis of the Singing Voice Conversion Challenge 2025 Evaluation Results

L. P. Violeta, X. Zhang, J. Shi, Y . Yasuda, W. Huang, Z. Wu, and T. Toda, “The singing voice conversion challenge 2025: From singer identity conversion to singing style conversion,” CoRR, vol. abs/2509.15629, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2509.15629 2

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.15629 2025
[27]

Efficient Learning on Successive Test Time Augmentation,

Y . Zang, Y . Zhang, M. Heydari, and Z. Duan, “Singfake: Singing voice deepfake detection,” inIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024, Seoul, Republic of Korea, April 14-19, 2024. IEEE, 2024, pp. 12 156–12 160. [Online]. Available: https://doi.org/10. 1109/ICASSP48485.2024.10448184 2

work page arXiv 2024
[28]

Ctrsvdd: A benchmark dataset and baseline analysis for controlled singing voice deepfake detection,

Y . Zang, J. Shi, Y . Zhang, R. Yamamoto, J. Han, Y . Tang, S. Xu, W. Zhao, J. Guo, T. Toda, and Z. Duan, “Ctrsvdd: A benchmark dataset and baseline analysis for controlled singing voice deepfake detection,” in25th Annual Conference of the International Speech Communication Association, Interspeech 2024, Kos, Greece, September 1-5, 2024, I. Lapidot and S....

work page doi:10.21437/interspeech.2024-2242 2024
[29]

SVDD 2024: The inaugural singing voice deepfake detection challenge,

Y . Zhang, Y . Zang, J. Shi, R. Yamamoto, T. Toda, and Z. Duan, “SVDD 2024: The inaugural singing voice deepfake detection challenge,” inIEEE Spoken Language Technology Workshop, SLT 2024, Macao, December 2-5, 2024. IEEE, 2024, pp. 782–787. [Online]. Available: https://doi.org/10.1109/SLT61566. 2024.10832284 3, 8

work page doi:10.1109/slt61566 2024
[30]

Singing voice graph modeling for singfake detection,

X. Chen, H. Wu, R. Jang, and H. Lee, “Singing voice graph modeling for singfake detection,” in 25th Annual Conference of the International Speech Communication Association, Interspeech 2024, Kos, Greece, September 1-5, 2024, I. Lapidot and S. Gannot, Eds. ISCA, 2024. [Online]. Available: https://doi.org/10.21437/Interspeech.2024-1185 3

work page doi:10.21437/interspeech.2024-1185 2024
[31]

Hubert: Self- supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self- supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021. 3, 8

2021
[32]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022. 3, 8

2022
[33]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,”Advances in neural infor- mation processing systems, vol. 33, pp. 12 449–12 460,
[34]

Xls-r: Self-supervised cross-lingual speech representation learning at scale,

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. V on Platen, Y . Saraf, J. Pinoet al., “Xls-r: Self-supervised cross-lingual speech representa- tion learning at scale,”arXiv preprint arXiv:2111.09296,

work page arXiv
[35]

Unispeech-sat: Uni- versal speech representation learning with speaker aware pre-training,

S. Chen, Y . Wu, C. Wang, Z. Chen, Z. Chen, S. Liu, J. Wu, Y . Qian, F. Wei, J. Liet al., “Unispeech-sat: Uni- versal speech representation learning with speaker aware pre-training,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6152–6156. 3, 8

2022
[36]

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,

H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoof- ing and deepfake detection using wav2vec 2.0 and data augmentation,”arXiv preprint arXiv:2202.12233, 2022. 3

work page arXiv 2022
[37]

Large-scale self-supervised speech representation learning for automatic speaker verification,

Z. Chen, S. Chen, Y . Wu, Y . Qian, C. Wang, S. Liu, Y . Qian, and M. Zeng, “Large-scale self-supervised speech representation learning for automatic speaker verification,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6147–6151. 3

2022
[38]

Explaining and Harnessing Adversarial Examples

I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y . Bengio and Y . LeCun, Eds., 2015. [Online]. Available: http://arxiv.org/abs/ 1412.6572 3

work page internal anchor Pith review Pith/arXiv arXiv 2015
[39]

Towards deep learning models resistant to adversarial attacks,

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. [Online]. 15 Available: https://openreview.net/forum?id=rJzIBfZAb 3

2018
[40]

Adversarial attacks on spoofing countermeasures of automatic speaker verification,

S. Liu, H. Wu, H. Lee, and H. Meng, “Adversarial attacks on spoofing countermeasures of automatic speaker verification,” inIEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019, Singapore, December 14-18, 2019. IEEE, 2019, pp. 312–319. [Online]. Available: https: //doi.org/10.1109/ASRU46091.2019.9003763 3

work page doi:10.1109/asru46091.2019.9003763 2019
[41]

Imperceptible, robust, and targeted adversarial examples for automatic speech recognition,

Y . Qin, N. Carlini, G. W. Cottrell, I. J. Goodfellow, and C. Raffel, “Imperceptible, robust, and targeted adversarial examples for automatic speech recognition,” inProceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R....

2019
[42]

Understanding contrastive rep- resentation learning through alignment and uniformity on the hypersphere,

T. Wang and P. Isola, “Understanding contrastive rep- resentation learning through alignment and uniformity on the hypersphere,” inInternational conference on ma- chine learning. PMLR, 2020, pp. 9929–9939. 4

2020
[43]

The power spherical distribu- tion,

N. De Cao and W. Aziz, “The power spherical distribu- tion,”arXiv preprint arXiv:2006.04437, 2020. 4

work page arXiv 2006
[44]

V oice activity detection (vad) in noisy environ- ments,

J. Ball, “V oice activity detection (vad) in noisy environ- ments,”arXiv preprint arXiv:2312.05815, 2023. 7

work page arXiv 2023
[45]

Nes2net: A lightweight nested architecture for founda- tion model driven speech anti-spoofing,

T. Liu, D.-T. Truong, R. K. Das, K. A. Lee, and H. Li, “Nes2net: A lightweight nested architecture for founda- tion model driven speech anti-spoofing,”arXiv preprint arXiv:2504.05657, 2025. 8, 13

work page arXiv 2025
[46]

Aasist: Audio anti- spoofing using integrated spectro-temporal graph atten- tion networks,

J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti- spoofing using integrated spectro-temporal graph atten- tion networks,” inICASSP 2022-2022 IEEE interna- tional conference on acoustics, speech and signal pro- cessing (ICASSP). IEEE, 2022, pp. 6367–6371. 8

2022
[47]

Audio deepfake detec- tion with self-supervised xls-r and sls classifier,

Q. Zhang, S. Wen, and T. Hu, “Audio deepfake detec- tion with self-supervised xls-r and sls classifier,” inPro- ceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6765–6773. 8

2024
[48]

Multi-level ssl feature gating for au- dio deepfake detection,

H. M. Tran, D. Lolive, A. Sini, A. Delhay, P.-F. Marteau, and D. Guennec, “Multi-level ssl feature gating for au- dio deepfake detection,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 11 766–11 775. 8

2025
[49]

Delving into Transferable Adversarial Examples and Black-box Attacks

Y . Liu, X. Chen, C. Liu, and D. Song, “Delving into transferable adversarial examples and black-box attacks,” arXiv preprint arXiv:1611.02770, 2016. 8

work page internal anchor Pith review Pith/arXiv arXiv 2016
[50]

Adversarial attacks on audio deepfake detection: A benchmark and comparative study,

K. Uddin, M. U. Farooq, A. Khan, and K. M. Ma- lik, “Adversarial attacks on audio deepfake detection: A benchmark and comparative study,”arXiv preprint arXiv:2509.07132, 2025. 8

work page arXiv 2025
[51]

Boosting adversarial attacks with momentum,

Y . Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9185–9193. 8

2018
[52]

Comprehensive layer-wise analysis of ssl models for audio deepfake detection,

Y . El Kheir, Y . Samih, S. Maharjan, T. Polzehl, and S. Möller, “Comprehensive layer-wise analysis of ssl models for audio deepfake detection,” inFindings of the Association for Computational Linguistics: NAACL 2025, 2025, pp. 4070–4082. 9, 18

2025
[53]

Fsd: An initial chinese dataset for fake song detection,

Y . Xie, J. Zhou, X. Lu, Z. Jiang, Y . Yang, H. Cheng, and L. Ye, “Fsd: An initial chinese dataset for fake song detection,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 4605–4609. 10

2024
[54]

Sonics: Synthetic or not–identifying counterfeit songs,

M. A. Rahman, Z. I. A. Hakim, N. H. Sarker, B. Paul, and S. A. Fattah, “Sonics: Synthetic or not–identifying counterfeit songs,”arXiv preprint arXiv:2408.14080,

work page arXiv
[55]

Udio: The ai music creation platform,

Udio, “Udio: The ai music creation platform,” 2024, accessed: 2026-01-27. [Online]. Available: https://www.udio.com/ 10

2024
[56]

Suno: Make a song with ai,

Suno AI, “Suno: Make a song with ai,” 2024, accessed: 2026-01-27. [Online]. Available: https://suno.com/ 10

2024
[57]

Hybrid trans- formers for music source separation,

S. Rouard, F. Massa, and A. Défossez, “Hybrid trans- formers for music source separation,” inICASSP 23,
[58]

Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hek- stra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in2001 IEEE international con- ference on acoustics, speech, and signal processing. Pro- ceedings (Cat. No. 01CH37221), vol. 2. IEEE, 2001, pp. 749–752. 12

2001
[59]

An algorithm for intelligibility prediction of time– frequency weighted noisy speech,

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time– frequency weighted noisy speech,”IEEE Transactions on audio, speech, and language processing, vol. 19, no. 7, pp. 2125–2136, 2011. 12

2011
[60]

Sdr–half-baked or well done?

J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or well done?” inICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 626–630. 12

2019
[61]

Spgm: Prioritizing local features for enhanced speech separa- tion performance,

J. Q. Yip, S. Zhao, Y . Ma, C. Ni, C. Zhang, H. Wang, T. H. Nguyen, K. Zhou, D. Ng, E. S. Chnget al., “Spgm: Prioritizing local features for enhanced speech separa- tion performance,” inICASSP 2024-2024 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 326–330. 12

2024
[62]

Fireredasr: Open-source industrial-grade mandarin speech recogni- tion models from encoder-decoder to llm integration,

K.-T. Xu, F.-L. Xie, X. Tang, and Y . Hu, “Fireredasr: Open-source industrial-grade mandarin speech recogni- tion models from encoder-decoder to llm integration,” arXiv preprint arXiv:2501.14350, 2025. 13 16

work page arXiv 2025
[63]

Songbsab: A dual prevention approach against singing voice conversion based illegal song covers,

G. Chen, Y . Zhang, F. Song, T. Wang, X. Du, and Y . Liu, “Songbsab: A dual prevention approach against singing voice conversion based illegal song covers,” arXiv preprint arXiv:2401.17133, 2024. 13

work page arXiv 2024
[64]

I can hear you: Selec- tive robust training for deepfake audio detection,

Z. Zhang, W. Hao, A. Sankoh, W. Lin, E. Mendiola- Ortiz, J. Yang, and C. Mao, “I can hear you: Selec- tive robust training for deepfake audio detection,”arXiv preprint arXiv:2411.00121, 2024. 13

work page arXiv 2024
[65]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”Iclr, vol. 1, no. 2, p. 3, 2022. 13 17 A Method A.1 Deepfake Effectiveness Performance Table 4: EER % of different SLS Deepfake Detectors (Mid-layer) on the CtrSVDD dataset. SSL Models EER (%) Wav2vec-B 8.39 Wav2vec-X ...

2022

[1] [1]

Diffsinger: Singing voice synthesis via shallow diffusion mechanism,

J. Liu, C. Li, Y . Ren, F. Chen, and Z. Zhao, “Diffsinger: Singing voice synthesis via shallow diffusion mechanism,” inThirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI...

2022

[2] [2]

11 020–11 028

AAAI Press, 2022, pp. 11 020–11 028. [Online]. Available: https://doi.org/10.1609/aaai.v36i10.21350 1

work page doi:10.1609/aaai.v36i10.21350 2022

[3] [3]

Nnsvs: A neural network-based singing voice synthesis toolkit,

R. Yamamoto, R. Yoneyama, and T. Toda, “Nnsvs: A neural network-based singing voice synthesis toolkit,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. 1

2023

[4] [4]

Muskits: an end-to-end music processing toolkit for singing voice synthesis,

J. Shi, S. Guo, T. Qian, N. Huo, T. Hayashi, Y . Wu, F. Xu, X. Chang, H. Li, P. Wuet al., “Muskits: an end-to-end music processing toolkit for singing voice synthesis,” arXiv preprint arXiv:2205.04029, 2022. 1

work page arXiv 2022

[5] [5]

(2023) The ai-generated song that sounds just like drake

BBC News. (2023) The ai-generated song that sounds just like drake. Accessed: 2026-01-29. [Online]. Available: https://www.bbc.com/news/articles/ cwyd3r62kp5o 1

2023

[6] [6]

A review of modern audio deepfake detection methods: challenges and future directions,

Z. Almutairi and H. Elgibreen, “A review of modern audio deepfake detection methods: challenges and future directions,”Algorithms, vol. 15, no. 5, p. 155, 2022. 1

2022

[7] [7]

Deepfake media forensics: Sta- tus and future challenges,

I. Amerini, M. Barni, S. Battiato, P. Bestagini, G. Boato, V . Bruni, R. Caldelli, F. De Natale, R. De Nicola, L. Guarneraet al., “Deepfake media forensics: Sta- tus and future challenges,”Journal of Imaging, vol. 11, no. 3, p. 73, 2025. 1

2025

[8] [8]

Svdd 2024: The inaugural singing voice deep- fake detection challenge,

Y . Zhang, Y . Zang, J. Shi, R. Yamamoto, T. Toda, and Z. Duan, “Svdd 2024: The inaugural singing voice deep- fake detection challenge,” in2024 IEEE Spoken Lan- guage Technology Workshop (SLT). IEEE, 2024, pp. 782–787. 1

2024

[9] [9]

Singfake: Singing voice deepfake detection,

Y . Zang, Y . Zhang, M. Heydari, and Z. Duan, “Singfake: Singing voice deepfake detection,” inICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 156–12 160. 1

2024

[10] [10]

Xwsb: A blend system utilizing xls-r and wavlm with sls classifier detection system for svdd 2024 challenge,

Q. Zhang, S. Wen, F. Yan, T. Hu, and J. Li, “Xwsb: A blend system utilizing xls-r and wavlm with sls classifier detection system for svdd 2024 challenge,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 788–794. 1, 8

2024

[11] [11]

Black- box attacks on spoofing countermeasures using transfer- ability of adversarial examples

Y . Zhang, Z. Jiang, J. Villalba, and N. Dehak, “Black- box attacks on spoofing countermeasures using transfer- ability of adversarial examples.” inInterspeech, 2020, pp. 4238–4242. 1, 2

2020

[12] [12]

Transferable adversarial attacks on audio deepfake de- tection,

M. U. Farooq, A. Khan, K. Uddin, and K. M. Malik, “Transferable adversarial attacks on audio deepfake de- tection,” inProceedings of the Winter Conference on Applications of Computer Vision, 2025, pp. 1640–1649. 1, 2

2025

[13] [13]

Nesterov accelerated gradient and scale invariance for adversarial attacks,

J. Lin, C. Song, K. He, L. Wang, and J. E. Hopcroft, “Nesterov accelerated gradient and scale invariance for adversarial attacks,”arXiv preprint arXiv:1908.06281,

work page arXiv 1908

[14] [14]

Enhancing the transferability of adversarial attacks through variance tuning,

X. Wang and K. He, “Enhancing the transferability of adversarial attacks through variance tuning,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1924–1933. 1, 2, 3, 8

2021

[15] [15]

Boosting adver- sarial transferability via negative hessian trace regular- ization,

Y . Long, Z. Tian, L. Zhang, and H. Xu, “Boosting adver- sarial transferability via negative hessian trace regular- ization,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 2386–2395. 1, 2, 3

2025

[16] [16]

Improving transferability of adversarial examples with input diversity,

C. Xie, Z. Zhang, Y . Zhou, S. Bai, J. Wang, Z. Ren, and A. L. Yuille, “Improving transferability of adversarial examples with input diversity,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2730–2739. 1, 2, 3, 8

2019

[17] [17]

Evading defenses to transferable adversarial examples by translation- invariant attacks,

Y . Dong, T. Pang, H. Su, and J. Zhu, “Evading defenses to transferable adversarial examples by translation- invariant attacks,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2019, pp. 4312–4321. 1, 2, 3, 8

2019

[18] [18]

Improving the transferability of adversarial samples with adversarial transformations,

W. Wu, Y . Su, M. R. Lyu, and I. King, “Improving the transferability of adversarial samples with adversarial transformations,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2021, pp. 9024–9033. 1, 2, 3

2021

[19] [19]

Enhancing adversarial transfer- ability with adversarial weight tuning,

J. Chen, Z. Feng, R. Zeng, Y . Pu, C. Zhou, Y . Jiang, Y . Gan, J. Li, and S. Ji, “Enhancing adversarial transfer- ability with adversarial weight tuning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 2, 2025, pp. 2061–2069. 1, 2, 3, 8

2025

[20] [20]

Ix. on the problem of the most efficient tests of statistical hypotheses,

J. Neyman and E. S. Pearson, “Ix. on the problem of the most efficient tests of statistical hypotheses,”Philosoph- ical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, vol. 231, no. 694-706, pp. 289–337, 1933. 2, 4

1933

[21] [21]

Diffsvc: A diffusion probabilistic model for singing voice conversion,

S. Liu, Y . Cao, D. Su, and H. Meng, “Diffsvc: A diffusion probabilistic model for singing voice conversion,” inIEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021. IEEE, 2021, pp. 14 741–748. [Online]. Available: https://doi.org/10.1109/ ASRU51503.2021.9688219 2

work page arXiv 2021

[22] [22]

A hierarchical speaker representation framework for one-shot singing voice conversion,

X. Li, S. Liu, and Y . Shan, “A hierarchical speaker representation framework for one-shot singing voice conversion,” in23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, September 18- 22, 2022, H. Ko and J. H. L. Hansen, Eds. ISCA, 2022, pp. 4307–4311. [Online]. Available: https://doi.org/10.21...

work page doi:10.21437/interspeech.2022-11305 2022

[23] [23]

Improving ad- versarial waveform generation based singing voice con- version with harmonic signals,

H. Guo, Z. Zhou, F. Meng, and K. Liu, “Improving ad- versarial waveform generation based singing voice con- version with harmonic signals,” inIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May

2022

[24] [24]

ADD 2022: The first audio deep synthesis detection challenge,

IEEE, 2022, pp. 6657–6661. [Online]. Available: https://doi.org/10.1109/ICASSP43922.2022.9746709 2

work page doi:10.1109/icassp43922.2022.9746709 2022

[25] [25]

The singing voice conversion challenge 2023,

W. Huang, L. P. Violeta, S. Liu, J. Shi, Y . Yasuda, and T. Toda, “The singing voice conversion challenge 2023,” CoRR, vol. abs/2306.14422, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.14422 2

work page doi:10.48550/arxiv.2306.14422 2023

[26] [26]

An Extensive Analysis of the Singing Voice Conversion Challenge 2025 Evaluation Results

L. P. Violeta, X. Zhang, J. Shi, Y . Yasuda, W. Huang, Z. Wu, and T. Toda, “The singing voice conversion challenge 2025: From singer identity conversion to singing style conversion,” CoRR, vol. abs/2509.15629, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2509.15629 2

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.15629 2025

[27] [27]

Efficient Learning on Successive Test Time Augmentation,

Y . Zang, Y . Zhang, M. Heydari, and Z. Duan, “Singfake: Singing voice deepfake detection,” inIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024, Seoul, Republic of Korea, April 14-19, 2024. IEEE, 2024, pp. 12 156–12 160. [Online]. Available: https://doi.org/10. 1109/ICASSP48485.2024.10448184 2

work page arXiv 2024

[28] [28]

Ctrsvdd: A benchmark dataset and baseline analysis for controlled singing voice deepfake detection,

Y . Zang, J. Shi, Y . Zhang, R. Yamamoto, J. Han, Y . Tang, S. Xu, W. Zhao, J. Guo, T. Toda, and Z. Duan, “Ctrsvdd: A benchmark dataset and baseline analysis for controlled singing voice deepfake detection,” in25th Annual Conference of the International Speech Communication Association, Interspeech 2024, Kos, Greece, September 1-5, 2024, I. Lapidot and S....

work page doi:10.21437/interspeech.2024-2242 2024

[29] [29]

SVDD 2024: The inaugural singing voice deepfake detection challenge,

Y . Zhang, Y . Zang, J. Shi, R. Yamamoto, T. Toda, and Z. Duan, “SVDD 2024: The inaugural singing voice deepfake detection challenge,” inIEEE Spoken Language Technology Workshop, SLT 2024, Macao, December 2-5, 2024. IEEE, 2024, pp. 782–787. [Online]. Available: https://doi.org/10.1109/SLT61566. 2024.10832284 3, 8

work page doi:10.1109/slt61566 2024

[30] [30]

Singing voice graph modeling for singfake detection,

X. Chen, H. Wu, R. Jang, and H. Lee, “Singing voice graph modeling for singfake detection,” in 25th Annual Conference of the International Speech Communication Association, Interspeech 2024, Kos, Greece, September 1-5, 2024, I. Lapidot and S. Gannot, Eds. ISCA, 2024. [Online]. Available: https://doi.org/10.21437/Interspeech.2024-1185 3

work page doi:10.21437/interspeech.2024-1185 2024

[31] [31]

Hubert: Self- supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self- supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021. 3, 8

2021

[32] [32]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022. 3, 8

2022

[33] [33]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,”Advances in neural infor- mation processing systems, vol. 33, pp. 12 449–12 460,

[34] [34]

Xls-r: Self-supervised cross-lingual speech representation learning at scale,

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. V on Platen, Y . Saraf, J. Pinoet al., “Xls-r: Self-supervised cross-lingual speech representa- tion learning at scale,”arXiv preprint arXiv:2111.09296,

work page arXiv

[35] [35]

Unispeech-sat: Uni- versal speech representation learning with speaker aware pre-training,

S. Chen, Y . Wu, C. Wang, Z. Chen, Z. Chen, S. Liu, J. Wu, Y . Qian, F. Wei, J. Liet al., “Unispeech-sat: Uni- versal speech representation learning with speaker aware pre-training,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6152–6156. 3, 8

2022

[36] [36]

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,

H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoof- ing and deepfake detection using wav2vec 2.0 and data augmentation,”arXiv preprint arXiv:2202.12233, 2022. 3

work page arXiv 2022

[37] [37]

Large-scale self-supervised speech representation learning for automatic speaker verification,

Z. Chen, S. Chen, Y . Wu, Y . Qian, C. Wang, S. Liu, Y . Qian, and M. Zeng, “Large-scale self-supervised speech representation learning for automatic speaker verification,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6147–6151. 3

2022

[38] [38]

Explaining and Harnessing Adversarial Examples

I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y . Bengio and Y . LeCun, Eds., 2015. [Online]. Available: http://arxiv.org/abs/ 1412.6572 3

work page internal anchor Pith review Pith/arXiv arXiv 2015

[39] [39]

Towards deep learning models resistant to adversarial attacks,

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. [Online]. 15 Available: https://openreview.net/forum?id=rJzIBfZAb 3

2018

[40] [40]

Adversarial attacks on spoofing countermeasures of automatic speaker verification,

S. Liu, H. Wu, H. Lee, and H. Meng, “Adversarial attacks on spoofing countermeasures of automatic speaker verification,” inIEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019, Singapore, December 14-18, 2019. IEEE, 2019, pp. 312–319. [Online]. Available: https: //doi.org/10.1109/ASRU46091.2019.9003763 3

work page doi:10.1109/asru46091.2019.9003763 2019

[41] [41]

Imperceptible, robust, and targeted adversarial examples for automatic speech recognition,

Y . Qin, N. Carlini, G. W. Cottrell, I. J. Goodfellow, and C. Raffel, “Imperceptible, robust, and targeted adversarial examples for automatic speech recognition,” inProceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R....

2019

[42] [42]

Understanding contrastive rep- resentation learning through alignment and uniformity on the hypersphere,

T. Wang and P. Isola, “Understanding contrastive rep- resentation learning through alignment and uniformity on the hypersphere,” inInternational conference on ma- chine learning. PMLR, 2020, pp. 9929–9939. 4

2020

[43] [43]

The power spherical distribu- tion,

N. De Cao and W. Aziz, “The power spherical distribu- tion,”arXiv preprint arXiv:2006.04437, 2020. 4

work page arXiv 2006

[44] [44]

V oice activity detection (vad) in noisy environ- ments,

J. Ball, “V oice activity detection (vad) in noisy environ- ments,”arXiv preprint arXiv:2312.05815, 2023. 7

work page arXiv 2023

[45] [45]

Nes2net: A lightweight nested architecture for founda- tion model driven speech anti-spoofing,

T. Liu, D.-T. Truong, R. K. Das, K. A. Lee, and H. Li, “Nes2net: A lightweight nested architecture for founda- tion model driven speech anti-spoofing,”arXiv preprint arXiv:2504.05657, 2025. 8, 13

work page arXiv 2025

[46] [46]

Aasist: Audio anti- spoofing using integrated spectro-temporal graph atten- tion networks,

J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti- spoofing using integrated spectro-temporal graph atten- tion networks,” inICASSP 2022-2022 IEEE interna- tional conference on acoustics, speech and signal pro- cessing (ICASSP). IEEE, 2022, pp. 6367–6371. 8

2022

[47] [47]

Audio deepfake detec- tion with self-supervised xls-r and sls classifier,

Q. Zhang, S. Wen, and T. Hu, “Audio deepfake detec- tion with self-supervised xls-r and sls classifier,” inPro- ceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6765–6773. 8

2024

[48] [48]

Multi-level ssl feature gating for au- dio deepfake detection,

H. M. Tran, D. Lolive, A. Sini, A. Delhay, P.-F. Marteau, and D. Guennec, “Multi-level ssl feature gating for au- dio deepfake detection,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 11 766–11 775. 8

2025

[49] [49]

Delving into Transferable Adversarial Examples and Black-box Attacks

Y . Liu, X. Chen, C. Liu, and D. Song, “Delving into transferable adversarial examples and black-box attacks,” arXiv preprint arXiv:1611.02770, 2016. 8

work page internal anchor Pith review Pith/arXiv arXiv 2016

[50] [50]

Adversarial attacks on audio deepfake detection: A benchmark and comparative study,

K. Uddin, M. U. Farooq, A. Khan, and K. M. Ma- lik, “Adversarial attacks on audio deepfake detection: A benchmark and comparative study,”arXiv preprint arXiv:2509.07132, 2025. 8

work page arXiv 2025

[51] [51]

Boosting adversarial attacks with momentum,

Y . Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9185–9193. 8

2018

[52] [52]

Comprehensive layer-wise analysis of ssl models for audio deepfake detection,

Y . El Kheir, Y . Samih, S. Maharjan, T. Polzehl, and S. Möller, “Comprehensive layer-wise analysis of ssl models for audio deepfake detection,” inFindings of the Association for Computational Linguistics: NAACL 2025, 2025, pp. 4070–4082. 9, 18

2025

[53] [53]

Fsd: An initial chinese dataset for fake song detection,

Y . Xie, J. Zhou, X. Lu, Z. Jiang, Y . Yang, H. Cheng, and L. Ye, “Fsd: An initial chinese dataset for fake song detection,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 4605–4609. 10

2024

[54] [54]

Sonics: Synthetic or not–identifying counterfeit songs,

M. A. Rahman, Z. I. A. Hakim, N. H. Sarker, B. Paul, and S. A. Fattah, “Sonics: Synthetic or not–identifying counterfeit songs,”arXiv preprint arXiv:2408.14080,

work page arXiv

[55] [55]

Udio: The ai music creation platform,

Udio, “Udio: The ai music creation platform,” 2024, accessed: 2026-01-27. [Online]. Available: https://www.udio.com/ 10

2024

[56] [56]

Suno: Make a song with ai,

Suno AI, “Suno: Make a song with ai,” 2024, accessed: 2026-01-27. [Online]. Available: https://suno.com/ 10

2024

[57] [57]

Hybrid trans- formers for music source separation,

S. Rouard, F. Massa, and A. Défossez, “Hybrid trans- formers for music source separation,” inICASSP 23,

[58] [58]

Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hek- stra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in2001 IEEE international con- ference on acoustics, speech, and signal processing. Pro- ceedings (Cat. No. 01CH37221), vol. 2. IEEE, 2001, pp. 749–752. 12

2001

[59] [59]

An algorithm for intelligibility prediction of time– frequency weighted noisy speech,

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time– frequency weighted noisy speech,”IEEE Transactions on audio, speech, and language processing, vol. 19, no. 7, pp. 2125–2136, 2011. 12

2011

[60] [60]

Sdr–half-baked or well done?

J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or well done?” inICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 626–630. 12

2019

[61] [61]

Spgm: Prioritizing local features for enhanced speech separa- tion performance,

J. Q. Yip, S. Zhao, Y . Ma, C. Ni, C. Zhang, H. Wang, T. H. Nguyen, K. Zhou, D. Ng, E. S. Chnget al., “Spgm: Prioritizing local features for enhanced speech separa- tion performance,” inICASSP 2024-2024 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 326–330. 12

2024

[62] [62]

Fireredasr: Open-source industrial-grade mandarin speech recogni- tion models from encoder-decoder to llm integration,

K.-T. Xu, F.-L. Xie, X. Tang, and Y . Hu, “Fireredasr: Open-source industrial-grade mandarin speech recogni- tion models from encoder-decoder to llm integration,” arXiv preprint arXiv:2501.14350, 2025. 13 16

work page arXiv 2025

[63] [63]

Songbsab: A dual prevention approach against singing voice conversion based illegal song covers,

G. Chen, Y . Zhang, F. Song, T. Wang, X. Du, and Y . Liu, “Songbsab: A dual prevention approach against singing voice conversion based illegal song covers,” arXiv preprint arXiv:2401.17133, 2024. 13

work page arXiv 2024

[64] [64]

I can hear you: Selec- tive robust training for deepfake audio detection,

Z. Zhang, W. Hao, A. Sankoh, W. Lin, E. Mendiola- Ortiz, J. Yang, and C. Mao, “I can hear you: Selec- tive robust training for deepfake audio detection,”arXiv preprint arXiv:2411.00121, 2024. 13

work page arXiv 2024

[65] [65]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”Iclr, vol. 1, no. 2, p. 3, 2022. 13 17 A Method A.1 Deepfake Effectiveness Performance Table 4: EER % of different SLS Deepfake Detectors (Mid-layer) on the CtrSVDD dataset. SSL Models EER (%) Wav2vec-B 8.39 Wav2vec-X ...

2022