Advancing Zero-Shot Open-Set Speech Deepfake Source Tracing
Pith reviewed 2026-05-18 12:41 UTC · model grok-4.3
The pith
Zero-shot cosine scoring outperforms few-shot methods for out-of-distribution deepfake source tracing while few-shot leads on in-distribution trials.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The adapted SSL-AASIST embeddings support open-set attack source verification, with few-shot Siamese and MLP reaching EERs of 17.72% and 13.11% on ID trials compared to 29.91% for zero-shot cosine scoring, while zero-shot cosine scoring reaches 16.43% EER on OOD trials, outperforming few-shot Siamese at 23.47% and MLP at 21.57%.
What carries the argument
Adapted SSL-AASIST embeddings enhanced with AAM loss and RegMixup for attack classification, paired with zero-shot cosine or few-shot Siamese and MLP scoring backends for verification.
If this is right
- Few-shot backends should be selected for source tracing when test attacks match the training distribution.
- Zero-shot cosine scoring is preferable when encountering entirely new attack types.
- Maintaining attack disjointness during training is necessary to validate generalization in open-set conditions.
- Hybrid systems could switch between zero-shot and few-shot scoring depending on observed distribution shift.
Where Pith is reading between the lines
- Real-world deployment may require automatic detection of whether a new deepfake belongs to the known or unknown attack distribution.
- The approach could extend to attributing deepfakes to specific generation tools beyond the current dataset.
- Combining embedding adaptation with distribution-aware backend selection might improve robustness across evolving attack landscapes.
Load-bearing premise
Training attacks can be kept completely disjoint from fingerprint-trial pairs while the embeddings still generalize to trace unseen attack sources.
What would settle it
Zero-shot cosine scoring failing to achieve lower EER than few-shot methods in OOD trials on an independent dataset with new disjoint attacks.
read the original abstract
We propose a novel zero-shot source tracing framework inspired by speaker verification. We adapt SSL-AASIST for attack classification, enhancing embeddings with AAM loss and RegMixup, and ensure that training attacks are disjoint from those forming fingerprint-trial pairs. For backend scoring in attack verification, we explore both zero-shot approaches (cosine similarity and Siamese) and few-shot approaches (MLP and Siamese). Experiments on our recently introduced STOPA dataset with an open set setting show that few-shot learning provides advantages in the in-distribution (ID) scenario, while zero-shot approaches perform better in the out-of-distribution (OOD) scenario. In attack source verification with ID trials, few-shot Siamese and MLP achieve equal error rates (EER) of 17.72% and 13.11%, compared to 29.91% for zero-shot cosine scoring. Conversely, in OOD trials, zero-shot cosine scoring reaches 16.43%, outperforming few-shot Siamese at 23.47% and MLP at 21.57%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a zero-shot open-set source tracing framework for speech deepfakes, adapting SSL-AASIST embeddings with AAM loss and RegMixup while keeping training attacks disjoint from fingerprint-trial pairs. It evaluates zero-shot (cosine similarity, Siamese) and few-shot (MLP, Siamese) backends for attack verification on the STOPA dataset under open-set conditions, claiming that few-shot methods yield lower EER in ID trials (MLP at 13.11%, Siamese at 17.72%) while zero-shot cosine scoring outperforms in OOD trials (16.43% vs. 21.57-23.47%).
Significance. If the strict disjointness and absence of leakage hold, the work offers a useful empirical comparison showing that backend choice should depend on whether the scenario is ID or OOD, advancing forensic tools for deepfake attribution beyond closed-set assumptions.
major comments (1)
- [Experimental setup / attack selection] Experimental setup (abstract and methods description of attack selection): The central open-set claim rests on the assertion that training attacks are kept completely disjoint from those forming fingerprint-trial pairs. However, no explicit validation, overlap analysis, or checks against indirect exposure via SSL pretraining data or shared synthesis artifacts in the adaptation stage are provided. This directly affects whether the reported EER gaps (e.g., few-shot 13.11% ID vs. zero-shot 16.43% OOD) demonstrate true generalization rather than partial leakage.
minor comments (2)
- [Results] Results section: The reported EER values lack error bars, confidence intervals, or statistical significance tests, making it difficult to assess the robustness of the ID/OOD performance differences.
- [Methods] Methods: No ablation studies are described that isolate the contributions of AAM loss versus RegMixup to the embedding quality or downstream verification performance.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive feedback. We address the concern about validating the disjointness of training attacks in the experimental setup below, and we will incorporate clarifications to strengthen the open-set claims.
read point-by-point responses
-
Referee: [Experimental setup / attack selection] Experimental setup (abstract and methods description of attack selection): The central open-set claim rests on the assertion that training attacks are kept completely disjoint from those forming fingerprint-trial pairs. However, no explicit validation, overlap analysis, or checks against indirect exposure via SSL pretraining data or shared synthesis artifacts in the adaptation stage are provided. This directly affects whether the reported EER gaps (e.g., few-shot 13.11% ID vs. zero-shot 16.43% OOD) demonstrate true generalization rather than partial leakage.
Authors: We agree that explicit documentation of the attack disjointness is essential to substantiate the open-set evaluation. In the revised manuscript, we will expand the Methods section with a dedicated subsection on attack selection. This will include: (1) an explicit enumeration of the specific deepfake generation methods (e.g., by name or reference to STOPA categories) assigned to the training set versus those reserved exclusively for constructing fingerprint-trial pairs in both ID and OOD partitions; (2) confirmation that the partitions were constructed to ensure zero overlap at the attack-instance level. Regarding indirect exposure, the SSL-AASIST backbone was pre-trained on ASVspoof 2019 LA, whose synthesis algorithms (e.g., conventional TTS/VC) differ from the modern neural vocoders and diffusion-based methods in STOPA; we will add a short paragraph stating this distinction and noting that no STOPA attacks appear in the pre-training corpus. For shared synthesis artifacts during adaptation, the combination of AAM loss and RegMixup encourages the embeddings to capture attack-specific discriminative cues rather than generic artifacts; we will report a supplementary cosine-similarity analysis between training and held-out attack embeddings to quantify any residual overlap. These additions will directly address the leakage concern while preserving the reported EER comparisons. revision: yes
Circularity Check
No circularity: purely empirical comparison on held-out disjoint data
full rationale
The manuscript reports experimental EER results from adapting SSL-AASIST embeddings (with AAM loss and RegMixup) and comparing zero-shot versus few-shot backends on the STOPA dataset under an explicitly stated disjoint training-attack condition for ID and OOD trials. No equations, predictions, or first-principles derivations are present that reduce by construction to fitted parameters or self-citations; the central claims are direct outcome measurements on held-out pairs, rendering the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption SSL-AASIST embeddings can be enhanced for attack classification via AAM loss and RegMixup while preserving generalization to unseen attacks.
- domain assumption Training attacks remain completely disjoint from fingerprint-trial pairs in the open-set evaluation.
Reference graph
Works this paper leans on
-
[1]
Advancing Zero-Shot Open-Set Speech Deepfake Source Tracing
INTRODUCTION “Trust, once lost, is not easily regained. ”Advances in neural speech synthesis and voice conversion now enable the creation of highly re- alistic spoofed speech [1]. Such speech is often indistinguishable from bonafide human speech, both for listeners and for automatic systems [2]. The research community has responded with increas- ingly pow...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
OPEN-SET ATTACK SOURCE VERIFICATION 2.1. Attack Source Verification Source tracingaims to identify or verify the source of a spoofing attack given an unknown utterancex. In anidentificationsetting, a systemF I predicts the source as ˆk= arg max k∈AID FI (x)k,(1) whereA ID denotes the set of in-distribution (seen or known) attacks. In theclosed-setcase, th...
-
[3]
Database We conduct experiments primarily on the recent, publicly available STOPA [14] dataset
EXPERIMENTAL SETUP, RESULTS AND DISCUSSION 3.1. Database We conduct experiments primarily on the recent, publicly available STOPA [14] dataset. It contains699k spoofed utterances from13 attack systems, formed by combining8acoustic models (AMs) and6vocoder models (VMs). Each utterance is labeled with its attack id as well as AM and VM ids, enabling multi-l...
work page 2019
-
[4]
CONCLUSION We addressed a realistic and challengingopen-set, zero-shot source tracingscenario. Specifically, we enhanced SSL-AASIST embed- dings with AAM loss and incorporated out-of-domain data to im- prove variability and robustness. In zero-shot tracing, cosine sim- ilarity generalized best to unseen attacks, while few-shot backends (MLP, Siamese) prov...
-
[5]
Audio deepfake detection: A survey,
Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiaohui Zhang, Chu Yuan Zhang, and Yan Zhao, “Audio deepfake detection: A survey,”arXiv preprint arXiv:2308.14970, 2023
-
[6]
A survey on speech deepfake detection,
Menglu Li, Yasaman Ahmadiadli, and Xiao-Ping Zhang, “A survey on speech deepfake detection,”ACM Computing Sur- veys, vol. 57, no. 7, pp. 1–38, 2025
work page 2025
-
[7]
Source tracing of audio deepfake systems,
Nicholas Klein, Tianxiang Chen, Hemlata Tak, Ricardo Casal, and Elie Khoury, “Source tracing of audio deepfake systems,” arXiv preprint arXiv:2407.08016, 2024
-
[8]
Source tracing: detecting voice spoofing,
Tinglong Zhu, Xingming Wang, Xiaoyi Qin, and Ming Li, “Source tracing: detecting voice spoofing,” in2022 Asia- Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2022, pp. 216– 220
work page 2022
-
[9]
Audio deepfake source tracing using multi-attribute open-set identification and verification,
Pierre Falez, Tony Marteau, Damien Lolive, and Arnaud Del- hay, “Audio deepfake source tracing using multi-attribute open-set identification and verification,” inProc. Interspeech 2025, 2025, pp. 1528–1532
work page 2025
-
[10]
Ajinkya Kulkarni, Sandipana Dowerah, Tanel Alum ¨ae, and Mathew Magimai Doss, “Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion,” inInterspeech 2025, 2025, pp. 1533– 1537
work page 2025
-
[11]
TADA: Training-free Attribution and Out-of-Domain Detec- tion of Audio Deepfakes,
Adriana Stan, David Combei, Dan Oneata, and Horia Cucu, “TADA: Training-free Attribution and Out-of-Domain Detec- tion of Audio Deepfakes,” inInterspeech 2025, 2025, pp. 1543–1547
work page 2025
-
[12]
Open-Set Source Tracing of Audio Deepfake Systems,
Nicholas Klein, Hemlata Tak, and Elie Khoury, “Open-Set Source Tracing of Audio Deepfake Systems,” inInterspeech 2025, 2025, pp. 1578–1582
work page 2025
-
[13]
Yang Xiao and Rohan Kumar Das, “Listen, Analyze, and Adapt to Learn New Attacks: An Exemplar-Free Class Incre- mental Learning Method for Audio Deepfake Source Tracing,” inInterspeech 2025, 2025, pp. 1563–1567
work page 2025
-
[14]
VIB- based Real Pre-emphasis Audio Deepfake Source Tracing,
Thien-Phuc Doan, Kihun Hong, and Souhwan Jung, “VIB- based Real Pre-emphasis Audio Deepfake Source Tracing,” in Interspeech 2025, 2025, pp. 1568–1572
work page 2025
-
[15]
Synthetic Speech Source Trac- ing using Metric Learning,
Dimitrios Koutsianos, Stavros Zacharopoulos, Yannis Pana- gakis, and Themos Stafylakis, “Synthetic Speech Source Trac- ing using Metric Learning,” inInterspeech 2025, 2025, pp. 1558–1562
work page 2025
-
[16]
Source Verification for Speech Deepfakes ,
Viola Negroni, Davide Salvi, Paolo Bestagini, and Stefano Tubaro, “ Source Verification for Speech Deepfakes ,” inIn- terspeech 2025, 2025, pp. 1548–1552
work page 2025
-
[17]
Deep residual learning for image recognition,
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” inProceed- ings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
work page 2016
-
[18]
Anton Firc, Manasi Chhibber, Jagabandhu Mishra, Vishwanath Pratap Singh, Tomi Kinnunen, and Kamil Malinka, “STOPA: A Dataset of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution,” inInterspeech 2025, 2025, pp. 1553–1557
work page 2025
-
[19]
Investigating self- supervised front ends for speech spoofing countermeasures,
Xin Wang and Junichi Yamagishi, “Investigating self- supervised front ends for speech spoofing countermeasures,” arXiv preprint arXiv:2111.07725, 2021
-
[20]
Arcface: Additive angular margin loss for deep face recogni- tion,
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou, “Arcface: Additive angular margin loss for deep face recogni- tion,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4690–4699
work page 2019
-
[21]
Momentum contrast for unsupervised visual represen- tation learning,
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Gir- shick, “Momentum contrast for unsupervised visual represen- tation learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729– 9738
work page 2020
-
[22]
Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,
Xin Wang, Junichi Yamagishi, Massimiliano Todisco, H ´ector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidul- lah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, et al., “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,”Computer Speech & Lan- guage, vol. 64, pp. 101114, 2020
work page 2019
-
[23]
Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee-weon Jung, Junichi Yamagishi, and Nicholas Evans, “Auto- matic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,”arXiv preprint arXiv:2202.12233, 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.