pith. sign in

arxiv: 2509.24674 · v2 · submitted 2025-09-29 · 📡 eess.AS

Advancing Zero-Shot Open-Set Speech Deepfake Source Tracing

Pith reviewed 2026-05-18 12:41 UTC · model grok-4.3

classification 📡 eess.AS
keywords deepfake source tracingzero-shot learningopen-set recognitionspeech deepfakesattack verificationAASIST embeddingsSTOPA datasetequal error rate
0
0 comments X

The pith

Zero-shot cosine scoring outperforms few-shot methods for out-of-distribution deepfake source tracing while few-shot leads on in-distribution trials.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a framework for tracing the source of speech deepfakes even when the attack type is unseen during training. It adapts SSL-AASIST embeddings using angular additive margin loss and RegMixup while enforcing complete separation between training attacks and those used to form fingerprint-trial pairs. Experiments on the STOPA dataset in an open-set setting demonstrate that few-shot Siamese and MLP backends achieve lower equal error rates on in-distribution trials, but zero-shot cosine similarity performs better on out-of-distribution trials.

Core claim

The adapted SSL-AASIST embeddings support open-set attack source verification, with few-shot Siamese and MLP reaching EERs of 17.72% and 13.11% on ID trials compared to 29.91% for zero-shot cosine scoring, while zero-shot cosine scoring reaches 16.43% EER on OOD trials, outperforming few-shot Siamese at 23.47% and MLP at 21.57%.

What carries the argument

Adapted SSL-AASIST embeddings enhanced with AAM loss and RegMixup for attack classification, paired with zero-shot cosine or few-shot Siamese and MLP scoring backends for verification.

If this is right

  • Few-shot backends should be selected for source tracing when test attacks match the training distribution.
  • Zero-shot cosine scoring is preferable when encountering entirely new attack types.
  • Maintaining attack disjointness during training is necessary to validate generalization in open-set conditions.
  • Hybrid systems could switch between zero-shot and few-shot scoring depending on observed distribution shift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-world deployment may require automatic detection of whether a new deepfake belongs to the known or unknown attack distribution.
  • The approach could extend to attributing deepfakes to specific generation tools beyond the current dataset.
  • Combining embedding adaptation with distribution-aware backend selection might improve robustness across evolving attack landscapes.

Load-bearing premise

Training attacks can be kept completely disjoint from fingerprint-trial pairs while the embeddings still generalize to trace unseen attack sources.

What would settle it

Zero-shot cosine scoring failing to achieve lower EER than few-shot methods in OOD trials on an independent dataset with new disjoint attacks.

read the original abstract

We propose a novel zero-shot source tracing framework inspired by speaker verification. We adapt SSL-AASIST for attack classification, enhancing embeddings with AAM loss and RegMixup, and ensure that training attacks are disjoint from those forming fingerprint-trial pairs. For backend scoring in attack verification, we explore both zero-shot approaches (cosine similarity and Siamese) and few-shot approaches (MLP and Siamese). Experiments on our recently introduced STOPA dataset with an open set setting show that few-shot learning provides advantages in the in-distribution (ID) scenario, while zero-shot approaches perform better in the out-of-distribution (OOD) scenario. In attack source verification with ID trials, few-shot Siamese and MLP achieve equal error rates (EER) of 17.72% and 13.11%, compared to 29.91% for zero-shot cosine scoring. Conversely, in OOD trials, zero-shot cosine scoring reaches 16.43%, outperforming few-shot Siamese at 23.47% and MLP at 21.57%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a zero-shot open-set source tracing framework for speech deepfakes, adapting SSL-AASIST embeddings with AAM loss and RegMixup while keeping training attacks disjoint from fingerprint-trial pairs. It evaluates zero-shot (cosine similarity, Siamese) and few-shot (MLP, Siamese) backends for attack verification on the STOPA dataset under open-set conditions, claiming that few-shot methods yield lower EER in ID trials (MLP at 13.11%, Siamese at 17.72%) while zero-shot cosine scoring outperforms in OOD trials (16.43% vs. 21.57-23.47%).

Significance. If the strict disjointness and absence of leakage hold, the work offers a useful empirical comparison showing that backend choice should depend on whether the scenario is ID or OOD, advancing forensic tools for deepfake attribution beyond closed-set assumptions.

major comments (1)
  1. [Experimental setup / attack selection] Experimental setup (abstract and methods description of attack selection): The central open-set claim rests on the assertion that training attacks are kept completely disjoint from those forming fingerprint-trial pairs. However, no explicit validation, overlap analysis, or checks against indirect exposure via SSL pretraining data or shared synthesis artifacts in the adaptation stage are provided. This directly affects whether the reported EER gaps (e.g., few-shot 13.11% ID vs. zero-shot 16.43% OOD) demonstrate true generalization rather than partial leakage.
minor comments (2)
  1. [Results] Results section: The reported EER values lack error bars, confidence intervals, or statistical significance tests, making it difficult to assess the robustness of the ID/OOD performance differences.
  2. [Methods] Methods: No ablation studies are described that isolate the contributions of AAM loss versus RegMixup to the embedding quality or downstream verification performance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback. We address the concern about validating the disjointness of training attacks in the experimental setup below, and we will incorporate clarifications to strengthen the open-set claims.

read point-by-point responses
  1. Referee: [Experimental setup / attack selection] Experimental setup (abstract and methods description of attack selection): The central open-set claim rests on the assertion that training attacks are kept completely disjoint from those forming fingerprint-trial pairs. However, no explicit validation, overlap analysis, or checks against indirect exposure via SSL pretraining data or shared synthesis artifacts in the adaptation stage are provided. This directly affects whether the reported EER gaps (e.g., few-shot 13.11% ID vs. zero-shot 16.43% OOD) demonstrate true generalization rather than partial leakage.

    Authors: We agree that explicit documentation of the attack disjointness is essential to substantiate the open-set evaluation. In the revised manuscript, we will expand the Methods section with a dedicated subsection on attack selection. This will include: (1) an explicit enumeration of the specific deepfake generation methods (e.g., by name or reference to STOPA categories) assigned to the training set versus those reserved exclusively for constructing fingerprint-trial pairs in both ID and OOD partitions; (2) confirmation that the partitions were constructed to ensure zero overlap at the attack-instance level. Regarding indirect exposure, the SSL-AASIST backbone was pre-trained on ASVspoof 2019 LA, whose synthesis algorithms (e.g., conventional TTS/VC) differ from the modern neural vocoders and diffusion-based methods in STOPA; we will add a short paragraph stating this distinction and noting that no STOPA attacks appear in the pre-training corpus. For shared synthesis artifacts during adaptation, the combination of AAM loss and RegMixup encourages the embeddings to capture attack-specific discriminative cues rather than generic artifacts; we will report a supplementary cosine-similarity analysis between training and held-out attack embeddings to quantify any residual overlap. These additions will directly address the leakage concern while preserving the reported EER comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison on held-out disjoint data

full rationale

The manuscript reports experimental EER results from adapting SSL-AASIST embeddings (with AAM loss and RegMixup) and comparing zero-shot versus few-shot backends on the STOPA dataset under an explicitly stated disjoint training-attack condition for ID and OOD trials. No equations, predictions, or first-principles derivations are present that reduce by construction to fitted parameters or self-citations; the central claims are direct outcome measurements on held-out pairs, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions plus the domain-specific choice to enforce disjoint training and test attacks; no new entities are postulated and hyperparameters such as loss weights are implicit but not enumerated as free parameters here.

axioms (2)
  • domain assumption SSL-AASIST embeddings can be enhanced for attack classification via AAM loss and RegMixup while preserving generalization to unseen attacks.
    Invoked in the adaptation step described in the abstract.
  • domain assumption Training attacks remain completely disjoint from fingerprint-trial pairs in the open-set evaluation.
    Stated explicitly as a requirement for the zero-shot setup.

pith-pipeline@v0.9.0 · 5720 in / 1460 out tokens · 32213 ms · 2026-05-18T12:41:06.920821+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 1 internal anchor

  1. [1]

    Advancing Zero-Shot Open-Set Speech Deepfake Source Tracing

    INTRODUCTION “Trust, once lost, is not easily regained. ”Advances in neural speech synthesis and voice conversion now enable the creation of highly re- alistic spoofed speech [1]. Such speech is often indistinguishable from bonafide human speech, both for listeners and for automatic systems [2]. The research community has responded with increas- ingly pow...

  2. [2]

    Attack Source Verification Source tracingaims to identify or verify the source of a spoofing attack given an unknown utterancex

    OPEN-SET ATTACK SOURCE VERIFICATION 2.1. Attack Source Verification Source tracingaims to identify or verify the source of a spoofing attack given an unknown utterancex. In anidentificationsetting, a systemF I predicts the source as ˆk= arg max k∈AID FI (x)k,(1) whereA ID denotes the set of in-distribution (seen or known) attacks. In theclosed-setcase, th...

  3. [3]

    Database We conduct experiments primarily on the recent, publicly available STOPA [14] dataset

    EXPERIMENTAL SETUP, RESULTS AND DISCUSSION 3.1. Database We conduct experiments primarily on the recent, publicly available STOPA [14] dataset. It contains699k spoofed utterances from13 attack systems, formed by combining8acoustic models (AMs) and6vocoder models (VMs). Each utterance is labeled with its attack id as well as AM and VM ids, enabling multi-l...

  4. [4]

    Specifically, we enhanced SSL-AASIST embed- dings with AAM loss and incorporated out-of-domain data to im- prove variability and robustness

    CONCLUSION We addressed a realistic and challengingopen-set, zero-shot source tracingscenario. Specifically, we enhanced SSL-AASIST embed- dings with AAM loss and incorporated out-of-domain data to im- prove variability and robustness. In zero-shot tracing, cosine sim- ilarity generalized best to unseen attacks, while few-shot backends (MLP, Siamese) prov...

  5. [5]

    Audio deepfake detection: A survey,

    Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiaohui Zhang, Chu Yuan Zhang, and Yan Zhao, “Audio deepfake detection: A survey,”arXiv preprint arXiv:2308.14970, 2023

  6. [6]

    A survey on speech deepfake detection,

    Menglu Li, Yasaman Ahmadiadli, and Xiao-Ping Zhang, “A survey on speech deepfake detection,”ACM Computing Sur- veys, vol. 57, no. 7, pp. 1–38, 2025

  7. [7]

    Source tracing of audio deepfake systems,

    Nicholas Klein, Tianxiang Chen, Hemlata Tak, Ricardo Casal, and Elie Khoury, “Source tracing of audio deepfake systems,” arXiv preprint arXiv:2407.08016, 2024

  8. [8]

    Source tracing: detecting voice spoofing,

    Tinglong Zhu, Xingming Wang, Xiaoyi Qin, and Ming Li, “Source tracing: detecting voice spoofing,” in2022 Asia- Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2022, pp. 216– 220

  9. [9]

    Audio deepfake source tracing using multi-attribute open-set identification and verification,

    Pierre Falez, Tony Marteau, Damien Lolive, and Arnaud Del- hay, “Audio deepfake source tracing using multi-attribute open-set identification and verification,” inProc. Interspeech 2025, 2025, pp. 1528–1532

  10. [10]

    Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion,

    Ajinkya Kulkarni, Sandipana Dowerah, Tanel Alum ¨ae, and Mathew Magimai Doss, “Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion,” inInterspeech 2025, 2025, pp. 1533– 1537

  11. [11]

    TADA: Training-free Attribution and Out-of-Domain Detec- tion of Audio Deepfakes,

    Adriana Stan, David Combei, Dan Oneata, and Horia Cucu, “TADA: Training-free Attribution and Out-of-Domain Detec- tion of Audio Deepfakes,” inInterspeech 2025, 2025, pp. 1543–1547

  12. [12]

    Open-Set Source Tracing of Audio Deepfake Systems,

    Nicholas Klein, Hemlata Tak, and Elie Khoury, “Open-Set Source Tracing of Audio Deepfake Systems,” inInterspeech 2025, 2025, pp. 1578–1582

  13. [13]

    Listen, Analyze, and Adapt to Learn New Attacks: An Exemplar-Free Class Incre- mental Learning Method for Audio Deepfake Source Tracing,

    Yang Xiao and Rohan Kumar Das, “Listen, Analyze, and Adapt to Learn New Attacks: An Exemplar-Free Class Incre- mental Learning Method for Audio Deepfake Source Tracing,” inInterspeech 2025, 2025, pp. 1563–1567

  14. [14]

    VIB- based Real Pre-emphasis Audio Deepfake Source Tracing,

    Thien-Phuc Doan, Kihun Hong, and Souhwan Jung, “VIB- based Real Pre-emphasis Audio Deepfake Source Tracing,” in Interspeech 2025, 2025, pp. 1568–1572

  15. [15]

    Synthetic Speech Source Trac- ing using Metric Learning,

    Dimitrios Koutsianos, Stavros Zacharopoulos, Yannis Pana- gakis, and Themos Stafylakis, “Synthetic Speech Source Trac- ing using Metric Learning,” inInterspeech 2025, 2025, pp. 1558–1562

  16. [16]

    Source Verification for Speech Deepfakes ,

    Viola Negroni, Davide Salvi, Paolo Bestagini, and Stefano Tubaro, “ Source Verification for Speech Deepfakes ,” inIn- terspeech 2025, 2025, pp. 1548–1552

  17. [17]

    Deep residual learning for image recognition,

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” inProceed- ings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  18. [18]

    STOPA: A Dataset of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution,

    Anton Firc, Manasi Chhibber, Jagabandhu Mishra, Vishwanath Pratap Singh, Tomi Kinnunen, and Kamil Malinka, “STOPA: A Dataset of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution,” inInterspeech 2025, 2025, pp. 1553–1557

  19. [19]

    Investigating self- supervised front ends for speech spoofing countermeasures,

    Xin Wang and Junichi Yamagishi, “Investigating self- supervised front ends for speech spoofing countermeasures,” arXiv preprint arXiv:2111.07725, 2021

  20. [20]

    Arcface: Additive angular margin loss for deep face recogni- tion,

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou, “Arcface: Additive angular margin loss for deep face recogni- tion,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4690–4699

  21. [21]

    Momentum contrast for unsupervised visual represen- tation learning,

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Gir- shick, “Momentum contrast for unsupervised visual represen- tation learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729– 9738

  22. [22]

    Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,

    Xin Wang, Junichi Yamagishi, Massimiliano Todisco, H ´ector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidul- lah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, et al., “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,”Computer Speech & Lan- guage, vol. 64, pp. 101114, 2020

  23. [23]

    Automatic speaker verification spoofing and deep- fake detection using wav2vec 2.0 and data augmentation

    Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee-weon Jung, Junichi Yamagishi, and Nicholas Evans, “Auto- matic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,”arXiv preprint arXiv:2202.12233, 2022