DeepFense: A Unified, Modular, and Extensible Framework for Robust Deepfake Audio Detection
Pith reviewed 2026-05-10 16:54 UTC · model grok-4.3
The pith
Pre-trained front-end feature extractors dominate variance in deepfake audio detection performance and embed biases by gender, language, and audio quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepFense integrates recent detection architectures, loss functions, and augmentation pipelines into a single extensible codebase along with more than 100 ready recipes. Experiments across more than 400 trained models establish that the pre-trained front-end feature extractor explains the largest share of performance variance, that curated training data measurably improves cross-domain generalization, and that high-accuracy models exhibit severe biases with respect to audio quality, speaker gender, and language.
What carries the argument
Pre-trained front-end feature extractor that converts raw audio waveforms into representations used by the downstream classifier and drives most observed differences in detection accuracy.
Load-bearing premise
The more than 400 evaluated models and chosen datasets sufficiently represent the diversity of real-world deepfake audio attacks and deployment conditions.
What would settle it
A new deepfake test set containing previously unseen languages, gender distributions, or audio quality levels on which the study's top models show neither large performance drops nor measurable subgroup biases.
Figures
read the original abstract
Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, we present DeepFense, a comprehensive, open-source PyTorch toolkit integrating the latest architectures, loss functions, and augmentation pipelines, alongside over 100 recipes. Using DeepFense, we conducted a large-scale evaluation of more than 400 models. Our findings reveal that while carefully curated training data improves cross-domain generalization, the choice of pre-trained front-end feature extractor dominates overall performance variance. Crucially, we show severe biases in high-performing models regarding audio quality, speaker gender, and language. DeepFense is expected to facilitate real-world deployment with the necessary tools to address equitable training data selection and front-end fine-tuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DeepFense, an open-source PyTorch toolkit for speech deepfake detection that unifies architectures, loss functions, augmentation pipelines, and over 100 recipes. Through evaluation of more than 400 models, the authors claim that pre-trained front-end feature extractors dominate performance variance, that carefully curated training data improves cross-domain generalization, and that high-performing models exhibit severe biases with respect to audio quality, speaker gender, and language. The toolkit is positioned to support reproducible research and equitable real-world deployment.
Significance. If the empirical patterns hold, the work provides a valuable standardized platform that directly addresses reproducibility challenges in deepfake audio detection. The large-scale sweep (>400 models) with an open-source modular implementation and explicit recipes constitutes a strong contribution by enabling community verification and extension. The attribution of variance to front-end choice and the documentation of quality/gender/language biases offer actionable insights for improving robustness and fairness, provided the evaluation details are clarified.
major comments (2)
- [Evaluation section] Evaluation section: the claim that front-end choice dominates overall performance variance is not accompanied by details on data splits, statistical significance testing of the observed differences, or variance decomposition methods, which are required to substantiate the dominance conclusion over other factors such as training data curation.
- [Bias analysis] Bias analysis: the reported severe biases in high-performing models on audio quality, speaker gender, and language axes lack explicit controls for confounding variables (e.g., dataset size imbalances or interactions with model architecture), undermining the attribution of these biases as described in the abstract and results.
minor comments (2)
- [Abstract] Abstract: include a brief statement on the number of datasets and primary metrics (e.g., EER or AUC) used in the >400-model evaluation to give readers immediate context for the scale of the claims.
- [Figures and tables] Figure and table captions: ensure all experimental result visualizations clearly label axes, report confidence intervals or standard deviations, and highlight the key takeaway regarding front-end dominance.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below with point-by-point responses, indicating revisions that will be incorporated to strengthen the statistical rigor and controls in the manuscript.
read point-by-point responses
-
Referee: [Evaluation section] the claim that front-end choice dominates overall performance variance is not accompanied by details on data splits, statistical significance testing of the observed differences, or variance decomposition methods, which are required to substantiate the dominance conclusion over other factors such as training data curation.
Authors: We appreciate this observation. While the large-scale evaluation of over 400 models was performed with fixed data protocols and showed consistent patterns favoring front-end extractors, the submitted manuscript did not include explicit variance decomposition or formal significance testing. In the revised version, we will expand the Evaluation section to specify the train/validation/test splits in detail, report results from statistical significance tests (e.g., ANOVA or mixed-effects models comparing front-end contributions against training data and architecture factors), and add a variance decomposition analysis to quantify relative contributions. These additions will directly substantiate the dominance claim. revision: yes
-
Referee: [Bias analysis] the reported severe biases in high-performing models on audio quality, speaker gender, and language axes lack explicit controls for confounding variables (e.g., dataset size imbalances or interactions with model architecture), undermining the attribution of these biases as described in the abstract and results.
Authors: We agree that explicit controls for confounders strengthen the bias analysis. Our original evaluation selected high-performing models and observed biases across multiple architectures, but did not fully adjust for dataset imbalances or interactions. In the revision, we will add controls in the Bias analysis section, including regression adjustments for dataset size and architecture interactions, stratified reporting where possible, and updated figures/tables showing bias metrics with these controls applied. The abstract and results will be revised to reflect the controlled findings. revision: yes
Circularity Check
No significant circularity: empirical benchmarking on external data
full rationale
The paper presents an open-source toolkit and reports direct performance measurements from evaluating more than 400 models across curated external datasets. Claims that front-end choice dominates variance and that high-performing models exhibit biases on quality/gender/language axes are observational results from this sweep, not outputs of any internal equations, fitted parameters renamed as predictions, or self-citation chains. No derivation steps exist that could reduce reported findings to inputs defined inside the paper; the work is self-contained against external benchmarks and explicitly falsifiable via the released recipes and code.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
DeepFense: A Unified, Modular, and Extensible Framework for Robust Deepfake Audio Detection
Introduction While speech synthesis technologies are essential in speech- based human-human and human-machine interaction, they are also being misused to forge speech deepfakes that threaten voice biometric systems [1] as well as human listeners [2]. The re- search community has increasingly devoted itself to the robust detection of speech deepfakes (and ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
DeepFense Framework 2.1. Design principle As Figure 1 illustrated, DeepFense implements a modular ar- chitecture that separates experimental specification, data pro- cessing, model training, execution, and evaluation. The com- plete training-evaluation setup is specified in asinglehuman- readable file. These loosely coupled and highly modular com- ponents...
work page 2019
-
[3]
Replicating SOTA Results To validate the correctness and reliability of the DeepFense framework, we assess its ability to replicate SOTA results from the literature. Specifically, we re-implement and evaluate sev- eral published systems under controlled conditions using Deep- Fense’s unified pipeline, ensuring identical preprocessing, opti- mizer configur...
-
[4]
Large-scale Comparison Prior studies typically evaluate a single model or a small set of architectures on one or two datasets, making it difficult to disentangle the contributions of the front-end, back-end, and training data from implementation-specific choices [4], [5]. A key motivation behind DeepFense is to enable system- Table 3:EER (%) forEATtrained...
-
[5]
Fairness Study Standard evaluation metrics like EER show overall perfor- mance, but they may hide biases that affect certain user groups or conditions. For deepfake detection systems to be used in real- world applications, ensuring equitable protection is paramount. In this section, we investigate the fairness of our trained mod- els across three vital di...
-
[6]
First, the choice of front-end turned out to be critical
Discussion Here we summarize the findings from the experimental evalua- tion with around 100 distinct systems built using the DeepFense toolkit. First, the choice of front-end turned out to be critical. Across the 13 evaluation datasets,Wav2Vec2performed the best, achieving a macro-average EER of 25.5%. However, the front-end superiority is domain-depende...
-
[7]
Conclusion In this work, we presented DeepFense, a PyTorch-based toolkit for speech deepfake detection with a configuration-driven de- sign that lowers the barrier to reproducible experimentation. Across a large-scale evaluation spanning 13 datasets and three fairness axes, two factors consistently dominate both perfor- mance and group-level equity: front...
-
[8]
ASVspoof: The automatic speaker verification spoofing and countermeasures challenge,
Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilc ¸i, M. Sahidullah, A. Sizov, N. Evans, M. Todisco, and H. Del- gado, “ASVspoof: The automatic speaker verification spoofing and countermeasures challenge,”IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 588–604, 2017
work page 2017
-
[9]
” Better be computer or I’m dumb
K. Warren, T. Tucker, A. Crowder, D. Olszewski, A. Lu, C. Fedele, M. Pasternak, S. Layton, K. Butler, C. Gates, et al., “” Better be computer or I’m dumb”: A large-scale evaluation of humans as audio deepfake detectors,” in Proc. ACM CCS, 2024, pp. 2696–2710
work page 2024
-
[10]
End-to-end anti-spoofing with RawNet2,
H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with RawNet2,” inProc. ICASSP, 2020, pp. 6369–6373
work page 2020
-
[11]
AASIST: Audio anti- spoofing using integrated spectro-temporal graph atten- tion networks,
J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “AASIST: Audio anti- spoofing using integrated spectro-temporal graph atten- tion networks,” inProc. ICASSP, IEEE, 2022, pp. 6367– 6371
work page 2022
-
[12]
H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data aug- mentation,” inProc. Odyssey, 2022, pp. 112–119
work page 2022
-
[13]
Y . El Kheir, T. Polzehl, and S. M ¨oller, “BiCrossMamba- ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention,” inProc. In- terspeech, 2025, pp. 2235–2239
work page 2025
-
[14]
ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech,
X. Wang et al., “ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” Computer Speech & Language, vol. 64, p. 101 114, 2020
work page 2019
-
[15]
ADD 2023: The Second Audio Deepfake Detection Challenge,
J. Yi et al., “ADD 2023: The Second Audio Deepfake Detection Challenge,” inProc. IJCAI Workshop onDeep- fake Audio Detection and Analysis, 2023
work page 2023
-
[16]
ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild,
X. Liu et al., “ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild,”IEEE/ACM Transactions on Audio, Speech, and Language Process- ing, vol. 31, pp. 2507–2522, 2023
work page 2021
-
[17]
H. Tak, M. R. Kamble, J. Patino, M. Todisco, and N. W. D. Evans, “RawBoost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” inProc. ICASSP, 2022, pp. 6382–6386
work page 2022
-
[18]
One-Class Learning Towards Synthetic V oice Spoofing Detection,
Y . Zhang, F. Jiang, and Z. Duan, “One-Class Learning Towards Synthetic V oice Spoofing Detection,”IEEE Sig- nal Processing Letters, vol. 28, pp. 937–941, 2021
work page 2021
-
[19]
Wav2vec 2.0: A framework for self-supervised learn- ing of speech representations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learn- ing of speech representations,” inProc. NeurIPS, vol. 33, 2020, pp. 12 449–12 460
work page 2020
-
[20]
Fairseq: A Fast, Exten- sible Toolkit for Sequence Modeling,
M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “Fairseq: A Fast, Exten- sible Toolkit for Sequence Modeling,” inProc. NAACL, W. Ammar, A. Louis, and N. Mostafazadeh, Eds., 2019, pp. 48–53
work page 2019
-
[21]
Transformers: State-of-the-Art Natural Language Processing,
T. Wolf et al., “Transformers: State-of-the-Art Natural Language Processing,” inProc. EMNLP, Online: Asso- ciation for Computational Linguistics, 2020, pp. 38–45
work page 2020
-
[22]
Speech df arena: A leaderboard for speech deepfake detection models,
S. Dowerah, A. Kulkarni, A. Kulkarni, H. M. Tran, J. Kalda, A. Fedorchenko, B. Fauve, D. Lolive, T. Alum¨ae, and M. Magimai.-Doss, “Speech df arena: A leaderboard for speech deepfake detection models,”IEEE Open Jour- nal of Signal Processing, vol. 7, pp. 73–81, 2026
work page 2026
-
[23]
Zhang et al.,Wedefense: A toolkit to defend against fake audio, 2026
L. Zhang et al.,Wedefense: A toolkit to defend against fake audio, 2026
work page 2026
-
[24]
SpeechBrain: A general- purpose speech toolkit
M. Ravanelli et al.,SpeechBrain: A general-purpose speech toolkit, arXiv:2106.04624, 2021
-
[25]
Post- training for deepfake speech detection,
W. Ge, X. Wang, X. Liu, and J. Yamagishi, “Post- training for deepfake speech detection,” inProc. ASRU, 2025
work page 2025
-
[26]
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,
S. Chen et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022
work page 2022
-
[27]
HuBERT: Self- supervised speech representation learning by masked prediction of hidden units,
W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self- supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021
work page 2021
-
[28]
EAT: Self-supervised pre-training with efficient audio trans- former,
W. Chen, Y . Liang, Z. Ma, Z. Zheng, and X. Chen, “EAT: Self-supervised pre-training with efficient audio trans- former,” inProc. IJCAI, 2024, pp. 3807–3815
work page 2024
-
[29]
MERT: Acoustic music understanding model with large-scale self-supervised training,
Y . LI et al., “MERT: Acoustic music understanding model with large-scale self-supervised training,” inProc. ICLR, 2024
work page 2024
-
[30]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational con- ference on machine learning, PMLR, 2023, pp. 28 492– 28 518
work page 2023
-
[31]
BEATs: Audio pre-training with acoustic tokenizers,
S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “BEATs: Audio pre-training with acoustic tokenizers,” inProc. ICML, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., ser. Proceedings of Machine Learning Re- search, vol. 202, PMLR, 2023, pp. 5178–5193
work page 2023
-
[32]
Y .-A. Chung, Y . Zhang, W. Han, C.-C. Chiu, J. Qin, R. Pang, and Y . Wu, “W2v-BERT: Combining con- trastive learning and masked language modeling for self- supervised speech pre-training,” inProc. ASRU, IEEE, 2021, pp. 244–250
work page 2021
-
[33]
B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propa- gation and aggregation in TDNN based speaker verifica- tion,” inProc. Interspeech, 2020, pp. 3830–3834
work page 2020
-
[34]
Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing,
T. Liu, D.-T. Truong, R. K. Das, K. A. Lee, and H. Li, “Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing,”IEEE Transactions on Information Forensics and Security, 2025
work page 2025
-
[35]
Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detec- tion,
D.-T. Truong, R. Tao, T. Nguyen, H.-T. Luong, K. A. Lee, and E. S. Chng, “Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detec- tion,” inProc. Interspeech, ISCA, 2024, pp. 537–541
work page 2024
-
[36]
Comprehensive layer-wise analysis of ssl mod- els for audio deepfake detection,
Y . El Kheir, Y . Samih, S. Maharjan, T. Polzehl, and S. M¨oller, “Comprehensive layer-wise analysis of ssl mod- els for audio deepfake detection,” inFindings of the As- sociation for Computational Linguistics: NAACL 2025, 2025, pp. 4070–4082
work page 2025
-
[37]
Additive margin softmax for face verification,
F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,”IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018
work page 2018
-
[38]
SphereFace: Deep Hypersphere Embedding for Face Recognition,
W. Liu, Y . Wen, Z. Yu, M. Li, B. Raj, and L. Song, “SphereFace: Deep Hypersphere Embedding for Face Recognition,” inProc. CVPR, Honolulu, HI: IEEE, 2017, pp. 6738–6746
work page 2017
-
[39]
Harder or different? Understanding gener- alization of audio deepfake detection,
N. M. M ¨uller, N. Evans, H. Tak, P. Sperl, and K. B¨ottinger, “Harder or different? Understanding gener- alization of audio deepfake detection,” inProc. Inter- speech, 2024, pp. 2705–2709
work page 2024
-
[40]
MLAAD: The Multi-Language Audio Anti-Spoofing Dataset,
N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G¨olge, T. M ¨uller, P. Syga, P. Sperl, and K. B ¨ottinger, “MLAAD: The Multi-Language Audio Anti-Spoofing Dataset,” inProc. IJCNN, Yokohama, Japan: IEEE, 2024, pp. 1–7
work page 2024
-
[41]
H. Wu, Y . Tseng, and H.-y. Lee, “CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems,” inProc. Inter- speech, 2024, pp. 1770–1774
work page 2024
-
[42]
HABLA: A dataset of latin american spanish accents for voice anti-spoofing.,
P. A. T. Fl ´orez, R. Manrique, and B. P. Nunes, “HABLA: A dataset of latin american spanish accents for voice anti-spoofing.,” inProc. Interspeech, 2023, pp. 1963– 1967
work page 2023
-
[43]
L. Zhang, X. Wang, E. Cooper, N. Evans, and J. Ya- magishi, “The PartialSpoof Database and Countermea- sures for the Detection of Short Fake Speech Segments Embedded in an Utterance,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 1–13, 2022
work page 2022
-
[44]
An open dataset of synthetic speech,
A. Yaroshchuk, C. Papastergiopoulos, L. Cuccovillo, P. Aichroth, K. V otis, and D. Tzovaras, “An open dataset of synthetic speech,” inProc. WIFS, IEEE, 2023, pp. 1–6
work page 2023
-
[45]
Re- play Attacks Against Audio Deepfake Detection,
N. M ¨uller, P. Kawa, W.-H. Choong, A. Stan, A. T. Bukkapatnam, K. Pizzi, A. Wagner, and P. Sperl, “Re- play Attacks Against Audio Deepfake Detection,” in Proc.Interspeech, 2025, pp. 2245–2249
work page 2025
-
[46]
Esdd 2026: Environmental sound deepfake detection challenge evalu- ation plan,
H. Yin, Y . Xiao, R. K. Das, J. Bai, and T. Dang, “Esdd 2026: Environmental sound deepfake detection challenge evaluation plan,”arXiv preprint arXiv:2508.04529, 2025
-
[47]
Investigating Self- Supervised Front Ends for Speech Spoofing Counter- measures,
X. Wang and J. Yamagishi, “Investigating Self- Supervised Front Ends for Speech Spoofing Counter- measures,” inProc. Odyssey, 2022, pp. 100–106
work page 2022
-
[48]
Exploiting the ASR n-best by tracking multiple dialog state hypotheses.,
J. D. Williams, “Exploiting the ASR n-best by tracking multiple dialog state hypotheses.,” inProc. Interspeech, 2008, pp. 191–194
work page 2008
-
[49]
X. Zhang, Y . Wang, L. Li, L. Jin, and M. Li, “Comp- spoof: A dataset and joint learning framework for component-level audio anti-spoofing countermeasures,” inProc. ICASSP, 2026
work page 2026
-
[50]
L. Comanducci, P. Bestagini, and S. Tubaro, “Fakemu- siccaps: A dataset for detection and attribution of syn- thetic music generated via text-to-music models,”Jour- nal of Imaging, vol. 11, no. 7, p. 242, 2025
work page 2025
-
[51]
Y . Zang et al., “CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing V oice Deep- fake Detection,” inProc. Interspeech, 2024, pp. 4783– 4787
work page 2024
-
[52]
O. Chouchane, C. Busch, C. Galdi, N. Evans, and M. Todisco, “A Comparison of Differential Performance Metrics for the Evaluation of Automatic Speaker Veri- fication Fairness,” inProc. Odyssey, 2024, pp. 209–216
work page 2024
-
[53]
Torchaudio-Squim: Reference-Less Speech Quality and Intelligibility Measures in Torchau- dio,
A. Kumar, K. Tan, Z. Ni, P. Manocha, X. Zhang, E. Hen- derson, and B. Xu, “Torchaudio-Squim: Reference-Less Speech Quality and Intelligibility Measures in Torchau- dio,” inProc. ICASSP, Rhodes Island, Greece: IEEE, 2023, pp. 1–5
work page 2023
-
[54]
G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A Deep CNN-Self-Attention Model for Mul- tidimensional Speech Quality Prediction with Crowd- sourced Datasets,” inProc. Interspeech, 2021, pp. 2127– 2131
work page 2021
-
[55]
High fidelity neural audio compression,
A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,”Transactions on Ma- chine Learning Research, 2023
work page 2023
-
[56]
SoundStream: An End-to-End Neural Au- dio Codec,
N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An End-to-End Neural Au- dio Codec,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.