pith. sign in

arxiv: 2604.08450 · v1 · submitted 2026-04-09 · 💻 cs.SD · eess.AS

DeepFense: A Unified, Modular, and Extensible Framework for Robust Deepfake Audio Detection

Pith reviewed 2026-05-10 16:54 UTC · model grok-4.3

classification 💻 cs.SD eess.AS
keywords deepfake detectionaudio deepfakesspeech synthesis detectionmodel bias analysisfront-end feature extractioncross-domain generalizationrobustness evaluationPyTorch toolkit
0
0 comments X

The pith

Pre-trained front-end feature extractors dominate variance in deepfake audio detection performance and embed biases by gender, language, and audio quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DeepFense, an open-source PyTorch toolkit that bundles architectures, loss functions, augmentations, and over 100 training recipes to standardize deepfake audio detection work. Large-scale runs of more than 400 models show that careful data curation improves generalization across domains, yet the choice of pre-trained front-end accounts for most differences in accuracy. Top models also display clear biases, with performance varying sharply according to speaker gender, spoken language, and audio quality. The framework supplies practical tools for selecting more equitable training data and fine-tuning front-ends to support real-world deployment.

Core claim

DeepFense integrates recent detection architectures, loss functions, and augmentation pipelines into a single extensible codebase along with more than 100 ready recipes. Experiments across more than 400 trained models establish that the pre-trained front-end feature extractor explains the largest share of performance variance, that curated training data measurably improves cross-domain generalization, and that high-accuracy models exhibit severe biases with respect to audio quality, speaker gender, and language.

What carries the argument

Pre-trained front-end feature extractor that converts raw audio waveforms into representations used by the downstream classifier and drives most observed differences in detection accuracy.

Load-bearing premise

The more than 400 evaluated models and chosen datasets sufficiently represent the diversity of real-world deepfake audio attacks and deployment conditions.

What would settle it

A new deepfake test set containing previously unseen languages, gender distributions, or audio quality levels on which the study's top models show neither large performance drops nor measurable subgroup biases.

Figures

Figures reproduced from arXiv: 2604.08450 by Arnab Das, Enes Erdem Erdogan, Feidi Kallel, Ngoc Thang Vu, Sebastian Moeller, Tim Polzehl, Xin Wang, Yassine El Kheir, Yixuan Xiao.

Figure 1
Figure 1. Figure 1: System architecture of the DeepFense framework, illustrating the configuration-driven data pipeline, modular model engine (front-end, back-end, loss), unified training loop, and evaluation components, with built-in logging and experiment tracking. tection. With an Apache 2.0 license, the toolkit is suitable for a wide range of users. • It provides more than 100 recipes and 400 pre-trained models for differ… view at source ↗
Figure 2
Figure 2. Figure 2: Boxplots of detection systems based on their EERs on the test set (x-axis). Each box groups 16 systems using the same front￾end (from left to right: Wav2Vec2, WavLM, HuBERT, and EAT) but varied training data and back-ends. The three lines correspond to mean EERs of systems using either the ASV19 (blue), ADD23 (red), or CodecFake (green) training sets. ASV19 LA21 DF21 ITW ODSS HABLA ADD22-1 ADD22-2 ADD23-1 … view at source ↗
Figure 3
Figure 3. Figure 3: Mean EER (%) per front-end and evaluation benchmark, averaged over all back-ends and training sets. atic, large-scale comparisons that are otherwise infeasible with￾out a unified framework. DeepFense makes this feasible by providing a single controlled pipeline where all configurations share identical preprocessing, optimizer settings, and evaluation protocols, ensuring that observed differences are attrib… view at source ↗
Figure 4
Figure 4. Figure 4: Mean EER analysis across training datasets, front-ends, and back-ends. (a) Front-end perspective. (b) Back-end perspective [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: EER on non-speech deepfake detection. 1 2 2 3 3 4 4 5 Quality band 0.00 0.10 0.20 0.30 0.40 EER (lower = better) (a) PESQ ASV19 -0.01 -0.15 -0.11 -0.10 -0.23 1 2 2 3 3 4 4 5 Quality band (b) PESQ CodecFake -0.09 -0.15 -0.14 -0.17 1 2 2 3 3 4 4 5 Quality band (c) NISQA-MOS ASV19 -0.01 -0.14 -0.15 -0.12 -0.19 1 2 2 3 3 4 4 5 Quality band (d) NISQA-MOS CodecFake -0.04 -0.04 -0.0915 Front-End Wav2Vec2.0 HuBERT… view at source ↗
Figure 6
Figure 6. Figure 6: EER Sensitivity to Audio Quality Across Systems trained on ASV19, and CodecFake tion does not sensitise models to quality variation. CodecFake, by contrast, is built on recent neural codec tokenizers (e.g. EnCodec [48], SoundStream [49]) that produce high-fidelity, perceptually realistic speech. Models trained on CodecFake, therefore, learn features tied to high-quality synthesis artefacts, which may gener… view at source ↗
Figure 7
Figure 7. Figure 7: EER (%) per language for all system configurations trained on ASVspoof 2019 (left) and CodecFake (right). Languages are sorted by mean EER (lowest at top). Each dot is one system; colour encodes front-end; marker shape encodes back-end. Grey bars show the min–max EER range across all configurations. The consistent reversal of bias direction between corpora strongly implicates training data gender imbalance… view at source ↗
read the original abstract

Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, we present DeepFense, a comprehensive, open-source PyTorch toolkit integrating the latest architectures, loss functions, and augmentation pipelines, alongside over 100 recipes. Using DeepFense, we conducted a large-scale evaluation of more than 400 models. Our findings reveal that while carefully curated training data improves cross-domain generalization, the choice of pre-trained front-end feature extractor dominates overall performance variance. Crucially, we show severe biases in high-performing models regarding audio quality, speaker gender, and language. DeepFense is expected to facilitate real-world deployment with the necessary tools to address equitable training data selection and front-end fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DeepFense, an open-source PyTorch toolkit for speech deepfake detection that unifies architectures, loss functions, augmentation pipelines, and over 100 recipes. Through evaluation of more than 400 models, the authors claim that pre-trained front-end feature extractors dominate performance variance, that carefully curated training data improves cross-domain generalization, and that high-performing models exhibit severe biases with respect to audio quality, speaker gender, and language. The toolkit is positioned to support reproducible research and equitable real-world deployment.

Significance. If the empirical patterns hold, the work provides a valuable standardized platform that directly addresses reproducibility challenges in deepfake audio detection. The large-scale sweep (>400 models) with an open-source modular implementation and explicit recipes constitutes a strong contribution by enabling community verification and extension. The attribution of variance to front-end choice and the documentation of quality/gender/language biases offer actionable insights for improving robustness and fairness, provided the evaluation details are clarified.

major comments (2)
  1. [Evaluation section] Evaluation section: the claim that front-end choice dominates overall performance variance is not accompanied by details on data splits, statistical significance testing of the observed differences, or variance decomposition methods, which are required to substantiate the dominance conclusion over other factors such as training data curation.
  2. [Bias analysis] Bias analysis: the reported severe biases in high-performing models on audio quality, speaker gender, and language axes lack explicit controls for confounding variables (e.g., dataset size imbalances or interactions with model architecture), undermining the attribution of these biases as described in the abstract and results.
minor comments (2)
  1. [Abstract] Abstract: include a brief statement on the number of datasets and primary metrics (e.g., EER or AUC) used in the >400-model evaluation to give readers immediate context for the scale of the claims.
  2. [Figures and tables] Figure and table captions: ensure all experimental result visualizations clearly label axes, report confidence intervals or standard deviations, and highlight the key takeaway regarding front-end dominance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below with point-by-point responses, indicating revisions that will be incorporated to strengthen the statistical rigor and controls in the manuscript.

read point-by-point responses
  1. Referee: [Evaluation section] the claim that front-end choice dominates overall performance variance is not accompanied by details on data splits, statistical significance testing of the observed differences, or variance decomposition methods, which are required to substantiate the dominance conclusion over other factors such as training data curation.

    Authors: We appreciate this observation. While the large-scale evaluation of over 400 models was performed with fixed data protocols and showed consistent patterns favoring front-end extractors, the submitted manuscript did not include explicit variance decomposition or formal significance testing. In the revised version, we will expand the Evaluation section to specify the train/validation/test splits in detail, report results from statistical significance tests (e.g., ANOVA or mixed-effects models comparing front-end contributions against training data and architecture factors), and add a variance decomposition analysis to quantify relative contributions. These additions will directly substantiate the dominance claim. revision: yes

  2. Referee: [Bias analysis] the reported severe biases in high-performing models on audio quality, speaker gender, and language axes lack explicit controls for confounding variables (e.g., dataset size imbalances or interactions with model architecture), undermining the attribution of these biases as described in the abstract and results.

    Authors: We agree that explicit controls for confounders strengthen the bias analysis. Our original evaluation selected high-performing models and observed biases across multiple architectures, but did not fully adjust for dataset imbalances or interactions. In the revision, we will add controls in the Bias analysis section, including regression adjustments for dataset size and architecture interactions, stratified reporting where possible, and updated figures/tables showing bias metrics with these controls applied. The abstract and results will be revised to reflect the controlled findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmarking on external data

full rationale

The paper presents an open-source toolkit and reports direct performance measurements from evaluating more than 400 models across curated external datasets. Claims that front-end choice dominates variance and that high-performing models exhibit biases on quality/gender/language axes are observational results from this sweep, not outputs of any internal equations, fitted parameters renamed as predictions, or self-citation chains. No derivation steps exist that could reduce reported findings to inputs defined inside the paper; the work is self-contained against external benchmarks and explicitly falsifiable via the released recipes and code.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on standard neural architectures, loss functions, and public datasets drawn from prior literature; no new free parameters, axioms, or invented entities are introduced to support the reported findings.

pith-pipeline@v0.9.0 · 5483 in / 1101 out tokens · 55812 ms · 2026-05-10T16:54:47.561806+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 1 internal anchor

  1. [1]

    DeepFense: A Unified, Modular, and Extensible Framework for Robust Deepfake Audio Detection

    Introduction While speech synthesis technologies are essential in speech- based human-human and human-machine interaction, they are also being misused to forge speech deepfakes that threaten voice biometric systems [1] as well as human listeners [2]. The re- search community has increasingly devoted itself to the robust detection of speech deepfakes (and ...

  2. [2]

    orchestrator

    DeepFense Framework 2.1. Design principle As Figure 1 illustrated, DeepFense implements a modular ar- chitecture that separates experimental specification, data pro- cessing, model training, execution, and evaluation. The com- plete training-evaluation setup is specified in asinglehuman- readable file. These loosely coupled and highly modular com- ponents...

  3. [3]

    Replicating SOTA Results To validate the correctness and reliability of the DeepFense framework, we assess its ability to replicate SOTA results from the literature. Specifically, we re-implement and evaluate sev- eral published systems under controlled conditions using Deep- Fense’s unified pipeline, ensuring identical preprocessing, opti- mizer configur...

  4. [4]

    A key motivation behind DeepFense is to enable system- Table 3:EER (%) forEATtrained on EnvSDD and evaluated onCodecF ake-A3test set

    Large-scale Comparison Prior studies typically evaluate a single model or a small set of architectures on one or two datasets, making it difficult to disentangle the contributions of the front-end, back-end, and training data from implementation-specific choices [4], [5]. A key motivation behind DeepFense is to enable system- Table 3:EER (%) forEATtrained...

  5. [5]

    For deepfake detection systems to be used in real- world applications, ensuring equitable protection is paramount

    Fairness Study Standard evaluation metrics like EER show overall perfor- mance, but they may hide biases that affect certain user groups or conditions. For deepfake detection systems to be used in real- world applications, ensuring equitable protection is paramount. In this section, we investigate the fairness of our trained mod- els across three vital di...

  6. [6]

    First, the choice of front-end turned out to be critical

    Discussion Here we summarize the findings from the experimental evalua- tion with around 100 distinct systems built using the DeepFense toolkit. First, the choice of front-end turned out to be critical. Across the 13 evaluation datasets,Wav2Vec2performed the best, achieving a macro-average EER of 25.5%. However, the front-end superiority is domain-depende...

  7. [7]

    Conclusion In this work, we presented DeepFense, a PyTorch-based toolkit for speech deepfake detection with a configuration-driven de- sign that lowers the barrier to reproducible experimentation. Across a large-scale evaluation spanning 13 datasets and three fairness axes, two factors consistently dominate both perfor- mance and group-level equity: front...

  8. [8]

    ASVspoof: The automatic speaker verification spoofing and countermeasures challenge,

    Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilc ¸i, M. Sahidullah, A. Sizov, N. Evans, M. Todisco, and H. Del- gado, “ASVspoof: The automatic speaker verification spoofing and countermeasures challenge,”IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 588–604, 2017

  9. [9]

    ” Better be computer or I’m dumb

    K. Warren, T. Tucker, A. Crowder, D. Olszewski, A. Lu, C. Fedele, M. Pasternak, S. Layton, K. Butler, C. Gates, et al., “” Better be computer or I’m dumb”: A large-scale evaluation of humans as audio deepfake detectors,” in Proc. ACM CCS, 2024, pp. 2696–2710

  10. [10]

    End-to-end anti-spoofing with RawNet2,

    H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with RawNet2,” inProc. ICASSP, 2020, pp. 6369–6373

  11. [11]

    AASIST: Audio anti- spoofing using integrated spectro-temporal graph atten- tion networks,

    J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “AASIST: Audio anti- spoofing using integrated spectro-temporal graph atten- tion networks,” inProc. ICASSP, IEEE, 2022, pp. 6367– 6371

  12. [12]

    Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data aug- mentation,

    H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data aug- mentation,” inProc. Odyssey, 2022, pp. 112–119

  13. [13]

    BiCrossMamba- ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention,

    Y . El Kheir, T. Polzehl, and S. M ¨oller, “BiCrossMamba- ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention,” inProc. In- terspeech, 2025, pp. 2235–2239

  14. [14]

    ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech,

    X. Wang et al., “ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” Computer Speech & Language, vol. 64, p. 101 114, 2020

  15. [15]

    ADD 2023: The Second Audio Deepfake Detection Challenge,

    J. Yi et al., “ADD 2023: The Second Audio Deepfake Detection Challenge,” inProc. IJCAI Workshop onDeep- fake Audio Detection and Analysis, 2023

  16. [16]

    ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild,

    X. Liu et al., “ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild,”IEEE/ACM Transactions on Audio, Speech, and Language Process- ing, vol. 31, pp. 2507–2522, 2023

  17. [17]

    RawBoost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,

    H. Tak, M. R. Kamble, J. Patino, M. Todisco, and N. W. D. Evans, “RawBoost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” inProc. ICASSP, 2022, pp. 6382–6386

  18. [18]

    One-Class Learning Towards Synthetic V oice Spoofing Detection,

    Y . Zhang, F. Jiang, and Z. Duan, “One-Class Learning Towards Synthetic V oice Spoofing Detection,”IEEE Sig- nal Processing Letters, vol. 28, pp. 937–941, 2021

  19. [19]

    Wav2vec 2.0: A framework for self-supervised learn- ing of speech representations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learn- ing of speech representations,” inProc. NeurIPS, vol. 33, 2020, pp. 12 449–12 460

  20. [20]

    Fairseq: A Fast, Exten- sible Toolkit for Sequence Modeling,

    M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “Fairseq: A Fast, Exten- sible Toolkit for Sequence Modeling,” inProc. NAACL, W. Ammar, A. Louis, and N. Mostafazadeh, Eds., 2019, pp. 48–53

  21. [21]

    Transformers: State-of-the-Art Natural Language Processing,

    T. Wolf et al., “Transformers: State-of-the-Art Natural Language Processing,” inProc. EMNLP, Online: Asso- ciation for Computational Linguistics, 2020, pp. 38–45

  22. [22]

    Speech df arena: A leaderboard for speech deepfake detection models,

    S. Dowerah, A. Kulkarni, A. Kulkarni, H. M. Tran, J. Kalda, A. Fedorchenko, B. Fauve, D. Lolive, T. Alum¨ae, and M. Magimai.-Doss, “Speech df arena: A leaderboard for speech deepfake detection models,”IEEE Open Jour- nal of Signal Processing, vol. 7, pp. 73–81, 2026

  23. [23]

    Zhang et al.,Wedefense: A toolkit to defend against fake audio, 2026

    L. Zhang et al.,Wedefense: A toolkit to defend against fake audio, 2026

  24. [24]

    SpeechBrain: A general- purpose speech toolkit

    M. Ravanelli et al.,SpeechBrain: A general-purpose speech toolkit, arXiv:2106.04624, 2021

  25. [25]

    Post- training for deepfake speech detection,

    W. Ge, X. Wang, X. Liu, and J. Yamagishi, “Post- training for deepfake speech detection,” inProc. ASRU, 2025

  26. [26]

    WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,

    S. Chen et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  27. [27]

    HuBERT: Self- supervised speech representation learning by masked prediction of hidden units,

    W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self- supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021

  28. [28]

    EAT: Self-supervised pre-training with efficient audio trans- former,

    W. Chen, Y . Liang, Z. Ma, Z. Zheng, and X. Chen, “EAT: Self-supervised pre-training with efficient audio trans- former,” inProc. IJCAI, 2024, pp. 3807–3815

  29. [29]

    MERT: Acoustic music understanding model with large-scale self-supervised training,

    Y . LI et al., “MERT: Acoustic music understanding model with large-scale self-supervised training,” inProc. ICLR, 2024

  30. [30]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational con- ference on machine learning, PMLR, 2023, pp. 28 492– 28 518

  31. [31]

    BEATs: Audio pre-training with acoustic tokenizers,

    S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “BEATs: Audio pre-training with acoustic tokenizers,” inProc. ICML, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., ser. Proceedings of Machine Learning Re- search, vol. 202, PMLR, 2023, pp. 5178–5193

  32. [32]

    W2v-BERT: Combining con- trastive learning and masked language modeling for self- supervised speech pre-training,

    Y .-A. Chung, Y . Zhang, W. Han, C.-C. Chiu, J. Qin, R. Pang, and Y . Wu, “W2v-BERT: Combining con- trastive learning and masked language modeling for self- supervised speech pre-training,” inProc. ASRU, IEEE, 2021, pp. 244–250

  33. [33]

    ECAPA-TDNN: Emphasized channel attention, propa- gation and aggregation in TDNN based speaker verifica- tion,

    B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propa- gation and aggregation in TDNN based speaker verifica- tion,” inProc. Interspeech, 2020, pp. 3830–3834

  34. [34]

    Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing,

    T. Liu, D.-T. Truong, R. K. Das, K. A. Lee, and H. Li, “Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing,”IEEE Transactions on Information Forensics and Security, 2025

  35. [35]

    Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detec- tion,

    D.-T. Truong, R. Tao, T. Nguyen, H.-T. Luong, K. A. Lee, and E. S. Chng, “Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detec- tion,” inProc. Interspeech, ISCA, 2024, pp. 537–541

  36. [36]

    Comprehensive layer-wise analysis of ssl mod- els for audio deepfake detection,

    Y . El Kheir, Y . Samih, S. Maharjan, T. Polzehl, and S. M¨oller, “Comprehensive layer-wise analysis of ssl mod- els for audio deepfake detection,” inFindings of the As- sociation for Computational Linguistics: NAACL 2025, 2025, pp. 4070–4082

  37. [37]

    Additive margin softmax for face verification,

    F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,”IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018

  38. [38]

    SphereFace: Deep Hypersphere Embedding for Face Recognition,

    W. Liu, Y . Wen, Z. Yu, M. Li, B. Raj, and L. Song, “SphereFace: Deep Hypersphere Embedding for Face Recognition,” inProc. CVPR, Honolulu, HI: IEEE, 2017, pp. 6738–6746

  39. [39]

    Harder or different? Understanding gener- alization of audio deepfake detection,

    N. M. M ¨uller, N. Evans, H. Tak, P. Sperl, and K. B¨ottinger, “Harder or different? Understanding gener- alization of audio deepfake detection,” inProc. Inter- speech, 2024, pp. 2705–2709

  40. [40]

    MLAAD: The Multi-Language Audio Anti-Spoofing Dataset,

    N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G¨olge, T. M ¨uller, P. Syga, P. Sperl, and K. B ¨ottinger, “MLAAD: The Multi-Language Audio Anti-Spoofing Dataset,” inProc. IJCNN, Yokohama, Japan: IEEE, 2024, pp. 1–7

  41. [41]

    CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems,

    H. Wu, Y . Tseng, and H.-y. Lee, “CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems,” inProc. Inter- speech, 2024, pp. 1770–1774

  42. [42]

    HABLA: A dataset of latin american spanish accents for voice anti-spoofing.,

    P. A. T. Fl ´orez, R. Manrique, and B. P. Nunes, “HABLA: A dataset of latin american spanish accents for voice anti-spoofing.,” inProc. Interspeech, 2023, pp. 1963– 1967

  43. [43]

    The PartialSpoof Database and Countermea- sures for the Detection of Short Fake Speech Segments Embedded in an Utterance,

    L. Zhang, X. Wang, E. Cooper, N. Evans, and J. Ya- magishi, “The PartialSpoof Database and Countermea- sures for the Detection of Short Fake Speech Segments Embedded in an Utterance,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 1–13, 2022

  44. [44]

    An open dataset of synthetic speech,

    A. Yaroshchuk, C. Papastergiopoulos, L. Cuccovillo, P. Aichroth, K. V otis, and D. Tzovaras, “An open dataset of synthetic speech,” inProc. WIFS, IEEE, 2023, pp. 1–6

  45. [45]

    Re- play Attacks Against Audio Deepfake Detection,

    N. M ¨uller, P. Kawa, W.-H. Choong, A. Stan, A. T. Bukkapatnam, K. Pizzi, A. Wagner, and P. Sperl, “Re- play Attacks Against Audio Deepfake Detection,” in Proc.Interspeech, 2025, pp. 2245–2249

  46. [46]

    Esdd 2026: Environmental sound deepfake detection challenge evalu- ation plan,

    H. Yin, Y . Xiao, R. K. Das, J. Bai, and T. Dang, “Esdd 2026: Environmental sound deepfake detection challenge evaluation plan,”arXiv preprint arXiv:2508.04529, 2025

  47. [47]

    Investigating Self- Supervised Front Ends for Speech Spoofing Counter- measures,

    X. Wang and J. Yamagishi, “Investigating Self- Supervised Front Ends for Speech Spoofing Counter- measures,” inProc. Odyssey, 2022, pp. 100–106

  48. [48]

    Exploiting the ASR n-best by tracking multiple dialog state hypotheses.,

    J. D. Williams, “Exploiting the ASR n-best by tracking multiple dialog state hypotheses.,” inProc. Interspeech, 2008, pp. 191–194

  49. [49]

    Comp- spoof: A dataset and joint learning framework for component-level audio anti-spoofing countermeasures,

    X. Zhang, Y . Wang, L. Li, L. Jin, and M. Li, “Comp- spoof: A dataset and joint learning framework for component-level audio anti-spoofing countermeasures,” inProc. ICASSP, 2026

  50. [50]

    Fakemu- siccaps: A dataset for detection and attribution of syn- thetic music generated via text-to-music models,

    L. Comanducci, P. Bestagini, and S. Tubaro, “Fakemu- siccaps: A dataset for detection and attribution of syn- thetic music generated via text-to-music models,”Jour- nal of Imaging, vol. 11, no. 7, p. 242, 2025

  51. [51]

    CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing V oice Deep- fake Detection,

    Y . Zang et al., “CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing V oice Deep- fake Detection,” inProc. Interspeech, 2024, pp. 4783– 4787

  52. [52]

    A Comparison of Differential Performance Metrics for the Evaluation of Automatic Speaker Veri- fication Fairness,

    O. Chouchane, C. Busch, C. Galdi, N. Evans, and M. Todisco, “A Comparison of Differential Performance Metrics for the Evaluation of Automatic Speaker Veri- fication Fairness,” inProc. Odyssey, 2024, pp. 209–216

  53. [53]

    Torchaudio-Squim: Reference-Less Speech Quality and Intelligibility Measures in Torchau- dio,

    A. Kumar, K. Tan, Z. Ni, P. Manocha, X. Zhang, E. Hen- derson, and B. Xu, “Torchaudio-Squim: Reference-Less Speech Quality and Intelligibility Measures in Torchau- dio,” inProc. ICASSP, Rhodes Island, Greece: IEEE, 2023, pp. 1–5

  54. [54]

    NISQA: A Deep CNN-Self-Attention Model for Mul- tidimensional Speech Quality Prediction with Crowd- sourced Datasets,

    G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A Deep CNN-Self-Attention Model for Mul- tidimensional Speech Quality Prediction with Crowd- sourced Datasets,” inProc. Interspeech, 2021, pp. 2127– 2131

  55. [55]

    High fidelity neural audio compression,

    A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,”Transactions on Ma- chine Learning Research, 2023

  56. [56]

    SoundStream: An End-to-End Neural Au- dio Codec,

    N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An End-to-End Neural Au- dio Codec,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2022