Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion

Ajinkya Kulkarni; Mathew Magimai.-Doss; Sandipana Dowerah; Tanel Alumae

arxiv: 2506.02085 · v1 · pith:ZAVT3CG7new · submitted 2025-06-02 · 💻 cs.SD · cs.AI· cs.CL· eess.AS

Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion

Ajinkya Kulkarni , Sandipana Dowerah , Tanel Alumae , Mathew Magimai.-Doss This is my paper

classification 💻 cs.SD cs.AIcs.CLeess.AS

keywords sourcetracingaudiorealconformerensemblefakefusion

0 comments

read the original abstract

Audio deepfakes are acquiring an unprecedented level of realism with advanced AI. While current research focuses on discerning real speech from spoofed speech, tracing the source system is equally crucial. This work proposes a novel audio source tracing system combining deep metric multi-class N-pair loss with Real Emphasis and Fake Dispersion framework, a Conformer classification network, and ensemble score-embedding fusion. The N-pair loss improves discriminative ability, while Real Emphasis and Fake Dispersion enhance robustness by focusing on differentiating real and fake speech patterns. The Conformer network captures both global and local dependencies in the audio signal, crucial for source tracing. The proposed ensemble score-embedding fusion shows an optimal trade-off between in-domain and out-of-domain source tracing scenarios. We evaluate our method using Frechet Distance and standard metrics, demonstrating superior performance in source tracing over the baseline system.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dual-Branch Gated Fusion for Open-Set Audio Deepfake Source Tracing
cs.SD 2026-06 unverdicted novelty 6.0

A gated fusion of XLSR-53 and CORES features with energy margin and diversity losses reaches 97.6% ID accuracy and reduces FPR95 by 83.5% relative to the Interspeech 2025 baseline on MLAAD.