pith. sign in

arxiv: 2502.20427 · v3 · pith:6PIAPJVWnew · submitted 2025-02-27 · 💻 cs.CR · cs.AI· cs.SD· eess.AS

DeePen: Penetration Testing for Audio Deepfake Detection

Pith reviewed 2026-05-23 02:40 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.SDeess.AS
keywords deepfake detectionpenetration testingaudio securityblack-box attackssignal processingmachine learning robustness
0
0 comments X

The pith

Audio deepfake detectors can be deceived by simple signal processing attacks like time-stretching and echo addition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DeePen, a methodology for testing the robustness of audio deepfake detection models in a black-box manner using signal processing modifications. It shows that both real-world and academic detection systems are vulnerable to these attacks, allowing reliable deception without model access. This is important because deepfake audio poses security risks, and current detection methods appear insufficiently robust. Some vulnerabilities can be addressed through retraining on specific attacks, yet others persist across retraining attempts.

Core claim

Using DeePen, we analyze both real-world production systems and publicly available academic model checkpoints, demonstrating that all tested systems exhibit weaknesses and can be reliably deceived by simple manipulations such as time-stretching or echo addition. Furthermore, our findings reveal that while some attacks can be mitigated by retraining detection systems with knowledge of the specific attack, others remain persistently effective.

What carries the argument

DeePen, a black-box penetration testing approach that applies a set of signal processing attacks to probe vulnerabilities in deepfake detectors without prior knowledge of the models.

If this is right

  • All tested deepfake detection systems can be reliably deceived by basic manipulations such as time-stretching or echo addition.
  • Retraining detection systems with knowledge of specific attacks can mitigate some vulnerabilities but not others.
  • Production systems and academic models alike show these weaknesses under black-box testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Detectors may require ongoing adaptation to new attack vectors beyond the tested set.
  • Black-box testing like this could become standard for evaluating security of media detection tools.
  • Alternative detection strategies, such as those based on different features, might be needed to achieve robustness.

Load-bearing premise

The carefully selected set of signal processing modifications is sufficient to expose meaningful vulnerabilities in deepfake detection models in a black-box setting.

What would settle it

A deepfake detector that maintains high accuracy on audio modified by all the DeePen attacks would disprove the universal vulnerability claim.

Figures

Figures reproduced from arXiv: 2502.20427 by Adriana Stan, Konstantin B\"ottinger, Nicolas M\"uller, Philip Sperl, Piotr Kawa, Souhwan Jung, Thien-Phuc Doan, Wei Herng Choong.

Figure 1
Figure 1. Figure 1: Application of DeePen methodology to an existing anti-spoofing dataset. Given a dataset such as ASVspoof 2019 or MLAAD, we extract [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Deepfakes - manipulated or forged audio and video media - pose significant security risks to individuals, organizations, and society at large. To address these challenges, machine learning-based classifiers are commonly employed to detect deepfake content. In this paper, we assess the robustness of such classifiers through a systematic penetration testing methodology, which we introduce as DeePen. Our approach operates without prior knowledge of or access to the target deepfake detection models. Instead, it leverages a set of carefully selected signal processing modifications - referred to as attacks - to evaluate model vulnerabilities. Using DeePen, we analyze both real-world production systems and publicly available academic model checkpoints, demonstrating that all tested systems exhibit weaknesses and can be reliably deceived by simple manipulations such as time-stretching or echo addition. Furthermore, our findings reveal that while some attacks can be mitigated by retraining detection systems with knowledge of the specific attack, others remain persistently effective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces DeePen, a black-box penetration testing methodology for audio deepfake detection models. It applies a fixed set of signal-processing attacks (e.g., time-stretching, echo addition) without model access or prior knowledge, evaluates both production systems and academic checkpoints, and claims that all tested detectors are vulnerable; some attacks can be mitigated by retraining while others remain effective.

Significance. If the attack set is shown to be chosen independently and the empirical results are reproducible, the work would usefully document concrete weaknesses in deployed detectors and the limits of simple augmentation-based defenses.

major comments (3)
  1. [Abstract / §4] Abstract and §4 (evaluation): no quantitative success rates, attack-selection criteria, or experimental protocol are stated, so it is impossible to assess whether the reported vulnerabilities are load-bearing or merely narrow sensitivities.
  2. [§3] §3 (attack design): the claim that the signal-processing modifications were selected without knowledge of the target models must be supported by an explicit, a-priori list and justification; otherwise the black-box robustness conclusion is circular.
  3. [§5] §5 (mitigation experiments): retraining with attack knowledge presupposes white-box access that is unavailable in the black-box phase; the asymmetry must be justified or the two phases cannot be compared directly.
minor comments (2)
  1. [§4] Add a table listing the exact attacks, their parameters, and per-model success rates.
  2. [§3] Clarify whether any of the listed attacks overlap with standard data-augmentation pipelines already used by the detectors.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed review and constructive comments. We respond to each major comment below and will make revisions to improve clarity, provide explicit details, and strengthen the presentation of our black-box methodology without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract / §4] Abstract and §4 (evaluation): no quantitative success rates, attack-selection criteria, or experimental protocol are stated, so it is impossible to assess whether the reported vulnerabilities are load-bearing or merely narrow sensitivities.

    Authors: We agree that the abstract is high-level and does not include specific quantitative success rates or attack-selection criteria. Section 4 reports results across production systems and academic checkpoints with attacks such as time-stretching and echo addition, but the experimental protocol and selection criteria can be stated more explicitly. We will revise the abstract to include representative quantitative success rates and expand §4 with a dedicated subsection on the experimental protocol and attack-selection criteria to enable assessment of the results' generality. revision: yes

  2. Referee: [§3] §3 (attack design): the claim that the signal-processing modifications were selected without knowledge of the target models must be supported by an explicit, a-priori list and justification; otherwise the black-box robustness conclusion is circular.

    Authors: The attacks were selected as standard, widely used signal-processing operations (time-stretching, echo addition, and similar) drawn from general audio processing literature, without reference to any target detector. To remove any ambiguity, we will add to the revised §3 an explicit enumerated list of all attacks together with a-priori justification based solely on their effects on audio signals, independent of the models later evaluated. revision: yes

  3. Referee: [§5] §5 (mitigation experiments): retraining with attack knowledge presupposes white-box access that is unavailable in the black-box phase; the asymmetry must be justified or the two phases cannot be compared directly.

    Authors: The black-box phase evaluates detectors with zero model access or knowledge. The mitigation experiments are a separate analysis that assumes only that the defender knows the attack type (not model internals) and can augment training data accordingly; this does not require white-box access. We will revise §5 to add an explicit justification of this distinction, clarifying that the two phases address different questions and are not intended to be compared under identical access assumptions. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical black-box testing with external targets

full rationale

The paper introduces DeePen as a black-box penetration testing methodology that applies a fixed set of signal-processing modifications to evaluate existing deepfake detectors. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the abstract or described claims. The central results consist of empirical observations on real-world production systems and public academic checkpoints; success or failure of the attacks is measured against those external models rather than being constructed from the paper's own inputs. The methodology is therefore self-contained as an evaluation procedure without reduction to its own fitted values or prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract; the work relies on empirical testing rather than theoretical derivations.

pith-pipeline@v0.9.0 · 5718 in / 1072 out tokens · 32794 ms · 2026-05-23T02:40:23.372980+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations

    eess.AS 2026-05 unverdicted novelty 5.0

    The RADAR Challenge 2026 provides a multilingual benchmark for audio deepfake detection under media transformations and finds that robust performance remains an open problem.

  2. RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations

    eess.AS 2026-05 unverdicted novelty 4.0

    RADAR Challenge 2026 describes a benchmark with over 100,000 multilingual utterances and media transformations for audio deepfake detection, reporting results from 22 teams that highlight ongoing robustness issues.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Par- rotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation,

    F. Biadsy, R. J. Weiss, P. J. Moreno, D. Kanvesky, and Y . Jia, “Par- rotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation,” in Proc. Interspeech 2019 , 2019, pp. 4115–4119

  2. [2]

    Apple introduces new features for cognitive accessibility, along with Live Speech, Personal V oice, and Point and Speak in Magnifier,

    Apple, “Apple introduces new features for cognitive accessibility, along with Live Speech, Personal V oice, and Point and Speak in Magnifier,” https://www.apple.com/newsroom/2023/05/ apple-previews-live-speech-personal-voice-and-more-new-accessibility-features/, 2023, Accessed: 17.10.2024

  3. [3]

    How deepfake videos are used to spread disinformation - the new york times,

    “How deepfake videos are used to spread disinformation - the new york times,” https://www.nytimes.com/2023/02/07/technology/ artificial-intelligence-training-deepfake.html, (Accessed: 16.10.2024)

  4. [4]

    Explicit ai-generated images of taylor swift circulate; can she sue for defamation?

    “Explicit ai-generated images of taylor swift circulate; can she sue for defamation?” https://www.scbc-law.org/post/ explicit-ai-generated-images-of-taylor-swift-circulate-can-she-sue-for-defamation, (Accessed: 16.10.2024)

  5. [5]

    Opinion — deepfake porn sites used her image. she’s fighting back. - the new york times,

    “Opinion — deepfake porn sites used her image. she’s fighting back. - the new york times,” https://www.nytimes.com/2024/04/08/opinion/ deepfake-porn-tech.html, (Accessed: 16.10.2024)

  6. [6]

    A voice deepfake was used to scam a ceo out of $243,000,

    “A voice deepfake was used to scam a ceo out of $243,000,” https://www.forbes.com/sites/jessedamiani/2019/09/03/ a-voice-deepfake-was-used-to-scam-a-ceo-out-of-243000/, (Accessed: 16.10.2024)

  7. [7]

    Finance worker pays out $25 million after video call with deep- fake ‘chief financial officer’ — cnn,

    “Finance worker pays out $25 million after video call with deep- fake ‘chief financial officer’ — cnn,” https://edition.cnn.com/2024/02/ 04/asia/deepfake-cfo-scam-hong-kong-intl-hnk/index.html, (Accessed: 16.10.2024)

  8. [8]

    NSE CEO deepfake: NSE urges caution after fake videos of CEO Ashish Chauhan recommending stocks go viral - The Economic Times,

    “NSE CEO deepfake: NSE urges caution after fake videos of CEO Ashish Chauhan recommending stocks go viral - The Economic Times,” https://economictimes.indiatimes.com/markets/stocks/ news/beware-of-deepfake-of-ceo-recommending-stocks-says-nse/ articleshow/109189329.cms?from=mdr, (Accessed: 16.10.2024)

  9. [9]

    A deepfake video showing volodymyr zelenskyy surrendering worries experts : Npr,

    “A deepfake video showing volodymyr zelenskyy surrendering worries experts : Npr,” https://www.npr.org/2022/03/16/1087062648/ deepfake-video-zelenskyy-experts-war-manipulation-ukraine-russia, (Accessed: 16.10.2024)

  10. [10]

    ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,

    X. Wang, H. Delgado, H. Tak, J.-w. Jung, H.-j. Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunen, N. Evans, K. A. Lee, and J. Yamagishi, “ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,” inASVspoof Workshop 2024 (accepted), 2024

  11. [11]

    Add 2023: the second audio deepfake detection challenge,

    J. Yi, J. Tao, R. Fu, X. Yan, C. Wang, T. Wang, C. Y . Zhang, X. Zhang, Y . Zhao, Y . Renet al., “Add 2023: the second audio deepfake detection challenge,” IJCAI 2023 Workshop on Deepfake Audio Detection (DADA 2023), 2023

  12. [12]

    Create a replica of your voice that sounds just like you,

    Eleven Labs, “Create a replica of your voice that sounds just like you,” https://elevenlabs.io/voice-cloning, 2024, Accessed: 17.10.2024

  13. [13]

    AI V oice Cloning,

    Respeecher, “AI V oice Cloning,” https://www.respeecher.com/ ai-voice-cloning, 2024, Accessed: 17.10.2024

  14. [14]

    AI V oice Cloning: Clone your V oice in Seconds,

    Resemble AI, “AI V oice Cloning: Clone your V oice in Seconds,” https: //www.resemble.ai/voice-cloning/, 2024, Accessed: 17.10.2024

  15. [15]

    Rawnet: Advanced end-to-end deep neural network using raw waveforms for text- independent speaker verification,

    J.-w. Jung, H.-s. Heo, j.-h. Kim, H.-j. Shim, and H.-j. Yu, “Rawnet: Advanced end-to-end deep neural network using raw waveforms for text- independent speaker verification,” Proc. Interspeech , pp. 1268–1272, 2019

  16. [16]

    Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,

    J. Jung, H. Heo, H. Tak, H. Shim, J. Chung, B. Lee, H. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” in ICASSP , IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings , 2022, pp. 2405–2409

  17. [17]

    Raw Differentiable Architecture Search for Speech Deepfake and Spoofing Detection,

    W. Ge, J. Patino, M. Todisco, and N. Evans, “Raw Differentiable Architecture Search for Speech Deepfake and Spoofing Detection,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 22–28

  18. [18]

    Replay detection using cqt-based modified group delay feature and resnewt network in asvspoof 2019,

    X. Cheng, M. Xu, and T. F. Zheng, “Replay detection using cqt-based modified group delay feature and resnewt network in asvspoof 2019,” in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) , 2019, pp. 540–545

  19. [19]

    Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition,

    Z. Wu, E. S. Chng, and H. Li, “Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition,” in Interspeech 2012, 2012, pp. 1700–1703

  20. [20]

    A comparative study on recent neural spoofing countermeasures for synthetic speech detection,

    X. Wang and J. Yamagishi, “A comparative study on recent neural spoofing countermeasures for synthetic speech detection,” in Interspeech 2021, 2021, pp. 4259–4263

  21. [21]

    Stc antispoofing systems for the asvspoof2021 challenge,

    A. Tomilov, A. Svishchev, M. V olkova, A. Chirkovskiy, A. Kondratev, and G. Lavrentyeva, “Stc antispoofing systems for the asvspoof2021 challenge,” in 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 61–67

  22. [22]

    End-to-end anti-spoofing with rawnet2,

    H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with rawnet2,” in IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) , 2021, pp. 6369–6373

  23. [23]

    Towards gen- eralisable and calibrated audio deepfake detection with self-supervised representations,

    O. Pascu, A. Stan, D. Oneata, E. Oneata, and H. Cucu, “Towards gen- eralisable and calibrated audio deepfake detection with self-supervised representations,” in Interspeech 2024, 2024, pp. 4828–4832

  24. [24]

    Improved DeepFake Detection Using Whisper Features,

    P. Kawa, M. Plata, M. Czuba, P. Szyma ´nski, and P. Syga, “Improved DeepFake Detection Using Whisper Features,” in Proc. INTERSPEECH 2023, 2023, pp. 4009–4013

  25. [25]

    Adapter learning from pre-trained model for robust spoof speech detection,

    H. Wu, W. Guo, S. Peng, Z. Li, and J. Zhang, “Adapter learning from pre-trained model for robust spoof speech detection,” in Interspeech 2024, 2024, pp. 2095–2099

  26. [26]

    Exploring green AI for audio deepfake detection,

    S. Saha, M. Sahidullah, and S. Das, “Exploring green AI for audio deepfake detection,” CoRR, vol. abs/2403.14290, 2024

  27. [27]

    Automatic Speaker Verification Spoofing and Deepfake Detection Us- ing Wav2vec 2.0 and Data Augmentation,

    H. Tak, M. Todisco, X. Wang, J. weon Jung, J. Yamagishi, and N. Evans, “Automatic Speaker Verification Spoofing and Deepfake Detection Us- ing Wav2vec 2.0 and Data Augmentation,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2022) , 2022, pp. 112–119

  28. [28]

    Does audio deepfake detection generalize?

    N. M. M ¨uller, P. Czempin, F. Dieckmann, A. Froghyar, and K. B¨ottinger, “Does audio deepfake detection generalize?” in Interspeech, 2022, pp. 2783–2787

  29. [29]

    The impact of silence on speech anti-spoofing,

    Y . Zhang, Z. Li, J. Lu, H. Hua, W. Wang, and P. Zhang, “The impact of silence on speech anti-spoofing,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 31, pp. 3374–3389, 2023

  30. [30]

    Speech is silver, silence is golden: What do asvspoof- trained models really learn?

    N. M. M ¨uller, F. Dieckmann, P. Czempin, R. Canals, and K. B ¨ottinger, “Speech is silver, silence is golden: What do asvspoof-trained models really learn?” ArXiv, vol. abs/2106.12914, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:235624055

  31. [31]

    Analyzing the impact of splicing artifacts in partially fake speech signals,

    V . Negroni, D. Salvi, P. Bestagini, and S. Tubaro, “Analyzing the impact of splicing artifacts in partially fake speech signals,” arXiv preprint arXiv:2408.13784, 2024

  32. [32]

    Generalizable speech spoofing detection against silence trimming with data augmentation and multi- task meta-learning,

    L. Wang, L. Yu, Y . Zhang, and H. Xie, “Generalizable speech spoofing detection against silence trimming with data augmentation and multi- task meta-learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3296–3310, 2024

  33. [33]

    Improving copy-synthesis anti-spoofing training method with rhythm and speaker perturbation,

    J. Lu, Y . Zhang, Z. Li, Z. Shang, W. Wang, and P. Zhang, “Improving copy-synthesis anti-spoofing training method with rhythm and speaker perturbation,” in Interspeech 2024, 2024, pp. 512–516

  34. [34]

    Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?

    X. Wang and J. Yamagishi, “Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?” Submitted to the 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024) , 4 2024. [Online]. Available: http://arxiv.org/pdf/2309.06014v1

  35. [35]

    Exploring self-supervised embeddings and synthetic data augmentation for robust audio deepfake detection,

    J. M. Mart ´ın-Do˜nas, A. ´Alvarez, E. Rosello, A. M. Gomez, and A. M. Peinado, “Exploring self-supervised embeddings and synthetic data augmentation for robust audio deepfake detection,” in Interspeech 2024, 2024, pp. 2085–2089

  36. [36]

    Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,

    H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” in ICASSP 2022-2022 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6382–6386

  37. [37]

    Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,

    H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” in The Speaker and Language Recognition Workshop, 2022

  38. [38]

    Self-supervised dataset pruning for efficient training in audio anti-spoofing,

    A. H. Azeemi, I. A. Qazi, and A. A. Raza, “Self-supervised dataset pruning for efficient training in audio anti-spoofing,” in INTERSPEECH 2023, 2023, pp. 2773–2777

  39. [39]

    Spoofing attack augmentation: Can differently-trained attack models improve gen- eralisation?

    W. Ge, X. Wang, J. Yamagishi, M. Todisco, and N. Evans, “Spoofing attack augmentation: Can differently-trained attack models improve gen- eralisation?” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2024, pp. 12 531– 12 535

  40. [40]

    Malafide: a novel adversarial convolutive noise attack against deepfake and spoofing detection systems,

    M. Panariello, W. Ge, H. Tak, M. Todisco, and N. Evans, “Malafide: a novel adversarial convolutive noise attack against deepfake and spoofing detection systems,” in INTERSPEECH 2023, 2023, pp. 2868–2872

  41. [41]

    Advshadow: Evading deepfake detection via adversarial shadow attack,

    J. Liu, M. Zhang, J. Ke, and L. Wang, “Advshadow: Evading deepfake detection via adversarial shadow attack,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 4640–4644

  42. [42]

    Malacopula: adversarial automatic speaker verification at- tacks using a neural-based generalised hammerstein model,

    M. Todisco, M. Panariello, X. Wang, H. Delgado, K.-A. Lee, and N. Evans, “Malacopula: adversarial automatic speaker verification at- tacks using a neural-based generalised hammerstein model,” in Proc. ASVspoof Workshop 2024, 2024

  43. [43]

    ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,

    M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. H. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,” in Proc. Interspeech 2019 , 2019, pp. 1008–1012

  44. [44]

    Deep residual neural networks for audio spoofing detection,

    M. Alzantot, Z. Wang, and M. B. Srivastava, “Deep residual neural networks for audio spoofing detection,” in Interspeech 2019, 2019, pp. 1078–1082

  45. [45]

    Does audio deepfake detection generalize?

    N. M ¨uller, P. Czempin, F. Diekmann, A. Froghyar, and K. B ¨ottinger, “Does audio deepfake detection generalize?” in Interspeech 2022, 2022, pp. 2783–2787

  46. [46]

    Attack agnostic dataset: Towards gen- eralization and stabilization of audio deepfake detection,

    P. Kawa, M. Plata, and P. Syga, “Attack agnostic dataset: Towards gen- eralization and stabilization of audio deepfake detection,” in Interspeech 2022, 2022, pp. 4023–4027

  47. [47]

    End-to-end spectro-temporal graph attention networks for speaker ver- ification anti-spoofing and speech deepfake detection,

    H. Tak, J. weon Jung, J. Patino, M. Kamble, M. Todisco, and N. Evans, “End-to-end spectro-temporal graph attention networks for speaker ver- ification anti-spoofing and speech deepfake detection,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Counter- measures Challenge, 2021, pp. 1–8

  48. [48]

    Complex-valued neural networks for voice anti-spoofing,

    N. M. M ¨uller, P. Sperl, and K. B ¨ottinger, “Complex-valued neural networks for voice anti-spoofing,” in INTERSPEECH 2023 , 2023, pp. 3814–3818

  49. [49]

    One-class knowledge distillation for spoofing speech detection,

    J. Lu, Y . Zhang, W. Wang, Z. Shang, and P. Zhang, “One-class knowledge distillation for spoofing speech detection,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

  50. [50]

    Samo: Speaker attractor multi-center one-class learning for voice anti-spoofing,

    S. Ding, Y . Zhang, and Z. Duan, “Samo: Speaker attractor multi-center one-class learning for voice anti-spoofing,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

  51. [51]

    Mlaad: The multi- language audio anti-spoofing dataset,

    N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G ¨olge, T. M ¨uller, P. Syga, P. Sperl, and K. B ¨ottinger, “Mlaad: The multi- language audio anti-spoofing dataset,” International Joint Conference on Neural Networks (IJCNN) , 2024

  52. [52]

    SpeechT5: Unified- modal encoder-decoder pre-training for spoken language processing,

    J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y . Wu, S. Liu, T. Ko, Q. Li, Y . Zhang, Z. Wei, Y . Qian, J. Li, and F. Wei, “SpeechT5: Unified- modal encoder-decoder pre-training for spoken language processing,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Dublin, Ireland: Association for...

  53. [53]

    Xtts: a massively multilingual zero-shot text-to-speech model,

    E. Casanova, K. Davis, E. G ¨olge, G. G ¨oknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi, and J. Weber, “Xtts: a massively multilingual zero-shot text-to-speech model,” in Interspeech 2024, 2024, pp. 4978–4982

  54. [54]

    Better speech synthesis through scaling,

    J. Betker, “Better speech synthesis through scaling,” arXiv preprint arXiv:2305.07243, 2023

  55. [55]

    MUSAN: A Music, Speech, and Noise Corpus

    D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1

  56. [56]

    Free music archive - instrumental,

    F. M. Archive, “Free music archive - instrumental,” https:// freemusicarchive.org/genre/Instrumental/, 2024, accessed: 10.10.2024

  57. [57]

    ESC: Dataset for Environmental Sound Classification,

    K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proceedings of the 23rd Annual ACM Conference on Multimedia . ACM Press, 2015, pp. 1015–1018. [Online]. Available: http://dl.acm. org/citation.cfm?doid=2733373.2806390

  58. [58]

    Simple auto-tune in python,

    J. Wilczek, “Simple auto-tune in python,” https://github.com/ JanWilczek/python-auto-tune, 2023, accessed: 10.10.2024

  59. [59]

    Robert, “Pydub,” https://github.com/jiaaro/pydub, 2024, accessed: 10.10.2024

    J. Robert, “Pydub,” https://github.com/jiaaro/pydub, 2024, accessed: 10.10.2024

  60. [60]

    librosa: Audio and music signal analysis in python

    B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python.” in SciPy, 2015, pp. 18–24

  61. [61]

    DARTS: Differentiable architecture search,

    H. Liu, K. Simonyan, and Y . Yang, “DARTS: Differentiable architecture search,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=S1eYHoC5FX

  62. [62]

    Speaker recognition from raw waveform with sincnet,

    M. Ravanelli and Y . Bengio, “Speaker recognition from raw waveform with sincnet,” in 2018 IEEE spoken language technology workshop (SLT). IEEE, 2018, pp. 1021–1028

  63. [63]

    Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

    J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014

  64. [64]

    Linear versus mel frequency cepstral coefficients for speaker recognition,

    X. Zhou, D. Garcia-Romero, R. Duraiswami, C. Espy-Wilson, and S. Shamma, “Linear versus mel frequency cepstral coefficients for speaker recognition,” in 2011 IEEE workshop on automatic speech recognition & understanding . IEEE, 2011, pp. 559–564

  65. [65]

    Robust speech recognition via large-scale weak supervi- sion,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” in International conference on machine learning . PMLR, 2023, pp. 28 492–28 518

  66. [66]

    MesoNet: a Compact Facial Video Forgery Detection Network,

    D. Afchar, V . Nozick, J. Yamagishi, and I. Echizen, “MesoNet: a Compact Facial Video Forgery Detection Network,” in 2018 IEEE International Workshop on Information Forensics and Security (WIFS) , 2018, pp. 1–7

  67. [67]

    wav2vec 2.0: A framework for self-supervised learning of speech representations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449– 12 460, 2020

  68. [68]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y . Bengio and Y . LeCun, Eds., 2015. [Online]. Available: http://arxiv.org/abs/1412.6980

  69. [69]

    RETRAIN ALL

    “X,” https://x.com/chai ste/status/1757717290865283282, (Accessed: 16.10.2024). attack→ Add Background Music Add Background Noise Amplitude Modulation Autotune Bit Depth Change Echo Equalization Freq Minus Freq Plus Gaussian Noise High Pass Filter Low Pass Filter MP3 Compression Pitch Shift Reverb Silence Injection Time Stretch No Attack Mean adaptive def...