Cosine similarity in SupCon with a delayed negative queue on wav2vec2 XLS-R yields the lowest equal error rates for deepfake audio detection on in-the-wild and pooled evaluations.
Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Supervised contrastive learning (SupCon) is widely used to shape representations, but has seen limited targeted study for audio deepfake detection. Existing work typically combines contrastive terms with broader pipelines; however, the focus on SupCon itself is missing. In this work, we run a controlled study on wav2vec2 XLS-R (300M) that varies (i) similarity in SupCon (cosine vs angular similarity derived from the hyperspherical angle) and (ii) negative scaling using a warm-started global cross-batch queue. Stage 1 fine-tunes the encoder and projection head with SupCon; Stage 2 freezes them and trains a linear classifier with BCE. Trained on ASVspoof 2019 LA and evaluated on ASV19 eval plus ITW and ASVspoof 2021 DF/LA, Cosine SupCon with a delayed queue achieves the best ITW EER (8.29%) and pooled EER (4.44), while angular similarity performs strongly without queued negatives (ITW 8.70), indicating reduced reliance on large negative sets.
fields
eess.AS 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection
Cosine similarity in SupCon with a delayed negative queue on wav2vec2 XLS-R yields the lowest equal error rates for deepfake audio detection on in-the-wild and pooled evaluations.