DeePen: Penetration Testing for Audio Deepfake Detection

Adriana Stan; Konstantin B\"ottinger; Nicolas M\"uller; Philip Sperl; Piotr Kawa; Souhwan Jung; Thien-Phuc Doan; Wei Herng Choong

arxiv: 2502.20427 · v3 · pith:6PIAPJVWnew · submitted 2025-02-27 · 💻 cs.CR · cs.AI· cs.SD· eess.AS

DeePen: Penetration Testing for Audio Deepfake Detection

Nicolas M\"uller , Piotr Kawa , Adriana Stan , Thien-Phuc Doan , Souhwan Jung , Wei Herng Choong , Philip Sperl , Konstantin B\"ottinger This is my paper

Pith reviewed 2026-05-23 02:40 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.SDeess.AS

keywords deepfake detectionpenetration testingaudio securityblack-box attackssignal processingmachine learning robustness

0 comments

The pith

Audio deepfake detectors can be deceived by simple signal processing attacks like time-stretching and echo addition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DeePen, a methodology for testing the robustness of audio deepfake detection models in a black-box manner using signal processing modifications. It shows that both real-world and academic detection systems are vulnerable to these attacks, allowing reliable deception without model access. This is important because deepfake audio poses security risks, and current detection methods appear insufficiently robust. Some vulnerabilities can be addressed through retraining on specific attacks, yet others persist across retraining attempts.

Core claim

Using DeePen, we analyze both real-world production systems and publicly available academic model checkpoints, demonstrating that all tested systems exhibit weaknesses and can be reliably deceived by simple manipulations such as time-stretching or echo addition. Furthermore, our findings reveal that while some attacks can be mitigated by retraining detection systems with knowledge of the specific attack, others remain persistently effective.

What carries the argument

DeePen, a black-box penetration testing approach that applies a set of signal processing attacks to probe vulnerabilities in deepfake detectors without prior knowledge of the models.

If this is right

All tested deepfake detection systems can be reliably deceived by basic manipulations such as time-stretching or echo addition.
Retraining detection systems with knowledge of specific attacks can mitigate some vulnerabilities but not others.
Production systems and academic models alike show these weaknesses under black-box testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Detectors may require ongoing adaptation to new attack vectors beyond the tested set.
Black-box testing like this could become standard for evaluating security of media detection tools.
Alternative detection strategies, such as those based on different features, might be needed to achieve robustness.

Load-bearing premise

The carefully selected set of signal processing modifications is sufficient to expose meaningful vulnerabilities in deepfake detection models in a black-box setting.

What would settle it

A deepfake detector that maintains high accuracy on audio modified by all the DeePen attacks would disprove the universal vulnerability claim.

Figures

Figures reproduced from arXiv: 2502.20427 by Adriana Stan, Konstantin B\"ottinger, Nicolas M\"uller, Philip Sperl, Piotr Kawa, Souhwan Jung, Thien-Phuc Doan, Wei Herng Choong.

**Figure 1.** Figure 1: Application of DeePen methodology to an existing anti-spoofing dataset. Given a dataset such as ASVspoof 2019 or MLAAD, we extract [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

Deepfakes - manipulated or forged audio and video media - pose significant security risks to individuals, organizations, and society at large. To address these challenges, machine learning-based classifiers are commonly employed to detect deepfake content. In this paper, we assess the robustness of such classifiers through a systematic penetration testing methodology, which we introduce as DeePen. Our approach operates without prior knowledge of or access to the target deepfake detection models. Instead, it leverages a set of carefully selected signal processing modifications - referred to as attacks - to evaluate model vulnerabilities. Using DeePen, we analyze both real-world production systems and publicly available academic model checkpoints, demonstrating that all tested systems exhibit weaknesses and can be reliably deceived by simple manipulations such as time-stretching or echo addition. Furthermore, our findings reveal that while some attacks can be mitigated by retraining detection systems with knowledge of the specific attack, others remain persistently effective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeePen introduces a black-box testing method for audio deepfake detectors and shows vulnerabilities to basic signal tweaks, but the attack selection and lack of quantitative details limit how far the robustness claims can be taken.

read the letter

The main point here is that the paper defines DeePen as a black-box penetration testing approach for audio deepfake detectors, applies a fixed set of signal processing attacks such as time-stretching and echo, and reports that every system tested—both production services and public academic checkpoints—can be fooled. It also notes that retraining on some attacks improves resistance while others stay effective. That is the actual new piece: a structured way to probe these detectors without model access, focused on audio rather than the more common image or video cases. Testing real deployed systems adds practical weight that many lab-only papers lack. The split between mitigable and persistent attacks is a useful observation if the numbers back it up. The soft spots sit in the attack selection and the evidence presented. The abstract and stress-test note both flag that the list of modifications needs to be justified as chosen without any target knowledge and as representative enough to support claims of general weakness. If the attacks overlap with common training augmentations, success against them does not prove the detectors are broadly fragile. The mitigation experiments require model access that the black-box phase explicitly rules out, so the paper needs to explain how that asymmetry is handled without circularity. No success rates, attack counts, or statistical details appear in the supplied abstract, which makes it hard to judge effect sizes. This work is aimed at researchers building or auditing audio deepfake detectors. A reader interested in practical robustness testing will get value from the framework even if the current results need tighter validation. It deserves peer review because the question it raises matters and the methodology is straightforward to replicate or extend, provided the full experiments include clear selection criteria and numbers.

Referee Report

3 major / 2 minor

Summary. The paper introduces DeePen, a black-box penetration testing methodology for audio deepfake detection models. It applies a fixed set of signal-processing attacks (e.g., time-stretching, echo addition) without model access or prior knowledge, evaluates both production systems and academic checkpoints, and claims that all tested detectors are vulnerable; some attacks can be mitigated by retraining while others remain effective.

Significance. If the attack set is shown to be chosen independently and the empirical results are reproducible, the work would usefully document concrete weaknesses in deployed detectors and the limits of simple augmentation-based defenses.

major comments (3)

[Abstract / §4] Abstract and §4 (evaluation): no quantitative success rates, attack-selection criteria, or experimental protocol are stated, so it is impossible to assess whether the reported vulnerabilities are load-bearing or merely narrow sensitivities.
[§3] §3 (attack design): the claim that the signal-processing modifications were selected without knowledge of the target models must be supported by an explicit, a-priori list and justification; otherwise the black-box robustness conclusion is circular.
[§5] §5 (mitigation experiments): retraining with attack knowledge presupposes white-box access that is unavailable in the black-box phase; the asymmetry must be justified or the two phases cannot be compared directly.

minor comments (2)

[§4] Add a table listing the exact attacks, their parameters, and per-model success rates.
[§3] Clarify whether any of the listed attacks overlap with standard data-augmentation pipelines already used by the detectors.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed review and constructive comments. We respond to each major comment below and will make revisions to improve clarity, provide explicit details, and strengthen the presentation of our black-box methodology without altering the core claims.

read point-by-point responses

Referee: [Abstract / §4] Abstract and §4 (evaluation): no quantitative success rates, attack-selection criteria, or experimental protocol are stated, so it is impossible to assess whether the reported vulnerabilities are load-bearing or merely narrow sensitivities.

Authors: We agree that the abstract is high-level and does not include specific quantitative success rates or attack-selection criteria. Section 4 reports results across production systems and academic checkpoints with attacks such as time-stretching and echo addition, but the experimental protocol and selection criteria can be stated more explicitly. We will revise the abstract to include representative quantitative success rates and expand §4 with a dedicated subsection on the experimental protocol and attack-selection criteria to enable assessment of the results' generality. revision: yes
Referee: [§3] §3 (attack design): the claim that the signal-processing modifications were selected without knowledge of the target models must be supported by an explicit, a-priori list and justification; otherwise the black-box robustness conclusion is circular.

Authors: The attacks were selected as standard, widely used signal-processing operations (time-stretching, echo addition, and similar) drawn from general audio processing literature, without reference to any target detector. To remove any ambiguity, we will add to the revised §3 an explicit enumerated list of all attacks together with a-priori justification based solely on their effects on audio signals, independent of the models later evaluated. revision: yes
Referee: [§5] §5 (mitigation experiments): retraining with attack knowledge presupposes white-box access that is unavailable in the black-box phase; the asymmetry must be justified or the two phases cannot be compared directly.

Authors: The black-box phase evaluates detectors with zero model access or knowledge. The mitigation experiments are a separate analysis that assumes only that the defender knows the attack type (not model internals) and can augment training data accordingly; this does not require white-box access. We will revise §5 to add an explicit justification of this distinction, clarifying that the two phases address different questions and are not intended to be compared under identical access assumptions. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical black-box testing with external targets

full rationale

The paper introduces DeePen as a black-box penetration testing methodology that applies a fixed set of signal-processing modifications to evaluate existing deepfake detectors. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the abstract or described claims. The central results consist of empirical observations on real-world production systems and public academic checkpoints; success or failure of the attacks is measured against those external models rather than being constructed from the paper's own inputs. The methodology is therefore self-contained as an evaluation procedure without reduction to its own fitted values or prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract; the work relies on empirical testing rather than theoretical derivations.

pith-pipeline@v0.9.0 · 5718 in / 1072 out tokens · 32794 ms · 2026-05-23T02:40:23.372980+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations
eess.AS 2026-05 unverdicted novelty 5.0

The RADAR Challenge 2026 provides a multilingual benchmark for audio deepfake detection under media transformations and finds that robust performance remains an open problem.
RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations
eess.AS 2026-05 unverdicted novelty 4.0

RADAR Challenge 2026 describes a benchmark with over 100,000 multilingual utterances and media transformations for audio deepfake detection, reporting results from 22 teams that highlight ongoing robustness issues.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Par- rotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation,

F. Biadsy, R. J. Weiss, P. J. Moreno, D. Kanvesky, and Y . Jia, “Par- rotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation,” in Proc. Interspeech 2019 , 2019, pp. 4115–4119

work page 2019
[2]

Apple introduces new features for cognitive accessibility, along with Live Speech, Personal V oice, and Point and Speak in Magnifier,

Apple, “Apple introduces new features for cognitive accessibility, along with Live Speech, Personal V oice, and Point and Speak in Magnifier,” https://www.apple.com/newsroom/2023/05/ apple-previews-live-speech-personal-voice-and-more-new-accessibility-features/, 2023, Accessed: 17.10.2024

work page 2023
[3]

How deepfake videos are used to spread disinformation - the new york times,

“How deepfake videos are used to spread disinformation - the new york times,” https://www.nytimes.com/2023/02/07/technology/ artificial-intelligence-training-deepfake.html, (Accessed: 16.10.2024)

work page 2023
[4]

Explicit ai-generated images of taylor swift circulate; can she sue for defamation?

“Explicit ai-generated images of taylor swift circulate; can she sue for defamation?” https://www.scbc-law.org/post/ explicit-ai-generated-images-of-taylor-swift-circulate-can-she-sue-for-defamation, (Accessed: 16.10.2024)

work page 2024
[5]

Opinion — deepfake porn sites used her image. she’s fighting back. - the new york times,

“Opinion — deepfake porn sites used her image. she’s fighting back. - the new york times,” https://www.nytimes.com/2024/04/08/opinion/ deepfake-porn-tech.html, (Accessed: 16.10.2024)

work page 2024
[6]

A voice deepfake was used to scam a ceo out of $243,000,

“A voice deepfake was used to scam a ceo out of $243,000,” https://www.forbes.com/sites/jessedamiani/2019/09/03/ a-voice-deepfake-was-used-to-scam-a-ceo-out-of-243000/, (Accessed: 16.10.2024)

work page 2019
[7]

Finance worker pays out $25 million after video call with deep- fake ‘chief financial officer’ — cnn,

“Finance worker pays out $25 million after video call with deep- fake ‘chief financial officer’ — cnn,” https://edition.cnn.com/2024/02/ 04/asia/deepfake-cfo-scam-hong-kong-intl-hnk/index.html, (Accessed: 16.10.2024)

work page 2024
[8]

NSE CEO deepfake: NSE urges caution after fake videos of CEO Ashish Chauhan recommending stocks go viral - The Economic Times,

“NSE CEO deepfake: NSE urges caution after fake videos of CEO Ashish Chauhan recommending stocks go viral - The Economic Times,” https://economictimes.indiatimes.com/markets/stocks/ news/beware-of-deepfake-of-ceo-recommending-stocks-says-nse/ articleshow/109189329.cms?from=mdr, (Accessed: 16.10.2024)

work page arXiv 2024
[9]

A deepfake video showing volodymyr zelenskyy surrendering worries experts : Npr,

“A deepfake video showing volodymyr zelenskyy surrendering worries experts : Npr,” https://www.npr.org/2022/03/16/1087062648/ deepfake-video-zelenskyy-experts-war-manipulation-ukraine-russia, (Accessed: 16.10.2024)

work page 2022
[10]

ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,

X. Wang, H. Delgado, H. Tak, J.-w. Jung, H.-j. Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunen, N. Evans, K. A. Lee, and J. Yamagishi, “ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,” inASVspoof Workshop 2024 (accepted), 2024

work page 2024
[11]

Add 2023: the second audio deepfake detection challenge,

J. Yi, J. Tao, R. Fu, X. Yan, C. Wang, T. Wang, C. Y . Zhang, X. Zhang, Y . Zhao, Y . Renet al., “Add 2023: the second audio deepfake detection challenge,” IJCAI 2023 Workshop on Deepfake Audio Detection (DADA 2023), 2023

work page 2023
[12]

Create a replica of your voice that sounds just like you,

Eleven Labs, “Create a replica of your voice that sounds just like you,” https://elevenlabs.io/voice-cloning, 2024, Accessed: 17.10.2024

work page 2024
[13]

AI V oice Cloning,

Respeecher, “AI V oice Cloning,” https://www.respeecher.com/ ai-voice-cloning, 2024, Accessed: 17.10.2024

work page 2024
[14]

AI V oice Cloning: Clone your V oice in Seconds,

Resemble AI, “AI V oice Cloning: Clone your V oice in Seconds,” https: //www.resemble.ai/voice-cloning/, 2024, Accessed: 17.10.2024

work page 2024
[15]

Rawnet: Advanced end-to-end deep neural network using raw waveforms for text- independent speaker verification,

J.-w. Jung, H.-s. Heo, j.-h. Kim, H.-j. Shim, and H.-j. Yu, “Rawnet: Advanced end-to-end deep neural network using raw waveforms for text- independent speaker verification,” Proc. Interspeech , pp. 1268–1272, 2019

work page 2019
[16]

Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,

J. Jung, H. Heo, H. Tak, H. Shim, J. Chung, B. Lee, H. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” in ICASSP , IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings , 2022, pp. 2405–2409

work page 2022
[17]

Raw Differentiable Architecture Search for Speech Deepfake and Spoofing Detection,

W. Ge, J. Patino, M. Todisco, and N. Evans, “Raw Differentiable Architecture Search for Speech Deepfake and Spoofing Detection,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 22–28

work page 2021
[18]

Replay detection using cqt-based modified group delay feature and resnewt network in asvspoof 2019,

X. Cheng, M. Xu, and T. F. Zheng, “Replay detection using cqt-based modified group delay feature and resnewt network in asvspoof 2019,” in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) , 2019, pp. 540–545

work page 2019
[19]

Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition,

Z. Wu, E. S. Chng, and H. Li, “Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition,” in Interspeech 2012, 2012, pp. 1700–1703

work page 2012
[20]

A comparative study on recent neural spoofing countermeasures for synthetic speech detection,

X. Wang and J. Yamagishi, “A comparative study on recent neural spoofing countermeasures for synthetic speech detection,” in Interspeech 2021, 2021, pp. 4259–4263

work page 2021
[21]

Stc antispoofing systems for the asvspoof2021 challenge,

A. Tomilov, A. Svishchev, M. V olkova, A. Chirkovskiy, A. Kondratev, and G. Lavrentyeva, “Stc antispoofing systems for the asvspoof2021 challenge,” in 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 61–67

work page 2021
[22]

End-to-end anti-spoofing with rawnet2,

H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with rawnet2,” in IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) , 2021, pp. 6369–6373

work page 2021
[23]

Towards gen- eralisable and calibrated audio deepfake detection with self-supervised representations,

O. Pascu, A. Stan, D. Oneata, E. Oneata, and H. Cucu, “Towards gen- eralisable and calibrated audio deepfake detection with self-supervised representations,” in Interspeech 2024, 2024, pp. 4828–4832

work page 2024
[24]

Improved DeepFake Detection Using Whisper Features,

P. Kawa, M. Plata, M. Czuba, P. Szyma ´nski, and P. Syga, “Improved DeepFake Detection Using Whisper Features,” in Proc. INTERSPEECH 2023, 2023, pp. 4009–4013

work page 2023
[25]

Adapter learning from pre-trained model for robust spoof speech detection,

H. Wu, W. Guo, S. Peng, Z. Li, and J. Zhang, “Adapter learning from pre-trained model for robust spoof speech detection,” in Interspeech 2024, 2024, pp. 2095–2099

work page 2024
[26]

Exploring green AI for audio deepfake detection,

S. Saha, M. Sahidullah, and S. Das, “Exploring green AI for audio deepfake detection,” CoRR, vol. abs/2403.14290, 2024

work page arXiv 2024
[27]

Automatic Speaker Verification Spoofing and Deepfake Detection Us- ing Wav2vec 2.0 and Data Augmentation,

H. Tak, M. Todisco, X. Wang, J. weon Jung, J. Yamagishi, and N. Evans, “Automatic Speaker Verification Spoofing and Deepfake Detection Us- ing Wav2vec 2.0 and Data Augmentation,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2022) , 2022, pp. 112–119

work page 2022
[28]

Does audio deepfake detection generalize?

N. M. M ¨uller, P. Czempin, F. Dieckmann, A. Froghyar, and K. B¨ottinger, “Does audio deepfake detection generalize?” in Interspeech, 2022, pp. 2783–2787

work page 2022
[29]

The impact of silence on speech anti-spoofing,

Y . Zhang, Z. Li, J. Lu, H. Hua, W. Wang, and P. Zhang, “The impact of silence on speech anti-spoofing,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 31, pp. 3374–3389, 2023

work page 2023
[30]

Speech is silver, silence is golden: What do asvspoof- trained models really learn?

N. M. M ¨uller, F. Dieckmann, P. Czempin, R. Canals, and K. B ¨ottinger, “Speech is silver, silence is golden: What do asvspoof-trained models really learn?” ArXiv, vol. abs/2106.12914, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:235624055

work page arXiv 2021
[31]

Analyzing the impact of splicing artifacts in partially fake speech signals,

V . Negroni, D. Salvi, P. Bestagini, and S. Tubaro, “Analyzing the impact of splicing artifacts in partially fake speech signals,” arXiv preprint arXiv:2408.13784, 2024

work page arXiv 2024
[32]

Generalizable speech spoofing detection against silence trimming with data augmentation and multi- task meta-learning,

L. Wang, L. Yu, Y . Zhang, and H. Xie, “Generalizable speech spoofing detection against silence trimming with data augmentation and multi- task meta-learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3296–3310, 2024

work page 2024
[33]

Improving copy-synthesis anti-spoofing training method with rhythm and speaker perturbation,

J. Lu, Y . Zhang, Z. Li, Z. Shang, W. Wang, and P. Zhang, “Improving copy-synthesis anti-spoofing training method with rhythm and speaker perturbation,” in Interspeech 2024, 2024, pp. 512–516

work page 2024
[34]

Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?

X. Wang and J. Yamagishi, “Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?” Submitted to the 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024) , 4 2024. [Online]. Available: http://arxiv.org/pdf/2309.06014v1

work page arXiv 2024
[35]

Exploring self-supervised embeddings and synthetic data augmentation for robust audio deepfake detection,

J. M. Mart ´ın-Do˜nas, A. ´Alvarez, E. Rosello, A. M. Gomez, and A. M. Peinado, “Exploring self-supervised embeddings and synthetic data augmentation for robust audio deepfake detection,” in Interspeech 2024, 2024, pp. 2085–2089

work page 2024
[36]

Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,

H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” in ICASSP 2022-2022 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6382–6386

work page 2022
[37]

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,

H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” in The Speaker and Language Recognition Workshop, 2022

work page 2022
[38]

Self-supervised dataset pruning for efficient training in audio anti-spoofing,

A. H. Azeemi, I. A. Qazi, and A. A. Raza, “Self-supervised dataset pruning for efficient training in audio anti-spoofing,” in INTERSPEECH 2023, 2023, pp. 2773–2777

work page 2023
[39]

Spoofing attack augmentation: Can differently-trained attack models improve gen- eralisation?

W. Ge, X. Wang, J. Yamagishi, M. Todisco, and N. Evans, “Spoofing attack augmentation: Can differently-trained attack models improve gen- eralisation?” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2024, pp. 12 531– 12 535

work page 2024
[40]

Malafide: a novel adversarial convolutive noise attack against deepfake and spoofing detection systems,

M. Panariello, W. Ge, H. Tak, M. Todisco, and N. Evans, “Malafide: a novel adversarial convolutive noise attack against deepfake and spoofing detection systems,” in INTERSPEECH 2023, 2023, pp. 2868–2872

work page 2023
[41]

Advshadow: Evading deepfake detection via adversarial shadow attack,

J. Liu, M. Zhang, J. Ke, and L. Wang, “Advshadow: Evading deepfake detection via adversarial shadow attack,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 4640–4644

work page 2024
[42]

Malacopula: adversarial automatic speaker verification at- tacks using a neural-based generalised hammerstein model,

M. Todisco, M. Panariello, X. Wang, H. Delgado, K.-A. Lee, and N. Evans, “Malacopula: adversarial automatic speaker verification at- tacks using a neural-based generalised hammerstein model,” in Proc. ASVspoof Workshop 2024, 2024

work page 2024
[43]

ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,

M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. H. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,” in Proc. Interspeech 2019 , 2019, pp. 1008–1012

work page 2019
[44]

Deep residual neural networks for audio spoofing detection,

M. Alzantot, Z. Wang, and M. B. Srivastava, “Deep residual neural networks for audio spoofing detection,” in Interspeech 2019, 2019, pp. 1078–1082

work page 2019
[45]

Does audio deepfake detection generalize?

N. M ¨uller, P. Czempin, F. Diekmann, A. Froghyar, and K. B ¨ottinger, “Does audio deepfake detection generalize?” in Interspeech 2022, 2022, pp. 2783–2787

work page 2022
[46]

Attack agnostic dataset: Towards gen- eralization and stabilization of audio deepfake detection,

P. Kawa, M. Plata, and P. Syga, “Attack agnostic dataset: Towards gen- eralization and stabilization of audio deepfake detection,” in Interspeech 2022, 2022, pp. 4023–4027

work page 2022
[47]

End-to-end spectro-temporal graph attention networks for speaker ver- ification anti-spoofing and speech deepfake detection,

H. Tak, J. weon Jung, J. Patino, M. Kamble, M. Todisco, and N. Evans, “End-to-end spectro-temporal graph attention networks for speaker ver- ification anti-spoofing and speech deepfake detection,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Counter- measures Challenge, 2021, pp. 1–8

work page 2021
[48]

Complex-valued neural networks for voice anti-spoofing,

N. M. M ¨uller, P. Sperl, and K. B ¨ottinger, “Complex-valued neural networks for voice anti-spoofing,” in INTERSPEECH 2023 , 2023, pp. 3814–3818

work page 2023
[49]

One-class knowledge distillation for spoofing speech detection,

J. Lu, Y . Zhang, W. Wang, Z. Shang, and P. Zhang, “One-class knowledge distillation for spoofing speech detection,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

work page 2024
[50]

Samo: Speaker attractor multi-center one-class learning for voice anti-spoofing,

S. Ding, Y . Zhang, and Z. Duan, “Samo: Speaker attractor multi-center one-class learning for voice anti-spoofing,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

work page 2023
[51]

Mlaad: The multi- language audio anti-spoofing dataset,

N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G ¨olge, T. M ¨uller, P. Syga, P. Sperl, and K. B ¨ottinger, “Mlaad: The multi- language audio anti-spoofing dataset,” International Joint Conference on Neural Networks (IJCNN) , 2024

work page 2024
[52]

SpeechT5: Unified- modal encoder-decoder pre-training for spoken language processing,

J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y . Wu, S. Liu, T. Ko, Q. Li, Y . Zhang, Z. Wei, Y . Qian, J. Li, and F. Wei, “SpeechT5: Unified- modal encoder-decoder pre-training for spoken language processing,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Dublin, Ireland: Association for...

work page 2022
[53]

Xtts: a massively multilingual zero-shot text-to-speech model,

E. Casanova, K. Davis, E. G ¨olge, G. G ¨oknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi, and J. Weber, “Xtts: a massively multilingual zero-shot text-to-speech model,” in Interspeech 2024, 2024, pp. 4978–4982

work page 2024
[54]

Better speech synthesis through scaling,

J. Betker, “Better speech synthesis through scaling,” arXiv preprint arXiv:2305.07243, 2023

work page arXiv 2023
[55]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1

work page internal anchor Pith review Pith/arXiv arXiv 2015
[56]

Free music archive - instrumental,

F. M. Archive, “Free music archive - instrumental,” https:// freemusicarchive.org/genre/Instrumental/, 2024, accessed: 10.10.2024

work page 2024
[57]

ESC: Dataset for Environmental Sound Classification,

K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proceedings of the 23rd Annual ACM Conference on Multimedia . ACM Press, 2015, pp. 1015–1018. [Online]. Available: http://dl.acm. org/citation.cfm?doid=2733373.2806390

work page arXiv 2015
[58]

Simple auto-tune in python,

J. Wilczek, “Simple auto-tune in python,” https://github.com/ JanWilczek/python-auto-tune, 2023, accessed: 10.10.2024

work page 2023
[59]

Robert, “Pydub,” https://github.com/jiaaro/pydub, 2024, accessed: 10.10.2024

J. Robert, “Pydub,” https://github.com/jiaaro/pydub, 2024, accessed: 10.10.2024

work page 2024
[60]

librosa: Audio and music signal analysis in python

B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python.” in SciPy, 2015, pp. 18–24

work page 2015
[61]

DARTS: Differentiable architecture search,

H. Liu, K. Simonyan, and Y . Yang, “DARTS: Differentiable architecture search,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=S1eYHoC5FX

work page 2019
[62]

Speaker recognition from raw waveform with sincnet,

M. Ravanelli and Y . Bengio, “Speaker recognition from raw waveform with sincnet,” in 2018 IEEE spoken language technology workshop (SLT). IEEE, 2018, pp. 1021–1028

work page 2018
[63]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[64]

Linear versus mel frequency cepstral coefficients for speaker recognition,

X. Zhou, D. Garcia-Romero, R. Duraiswami, C. Espy-Wilson, and S. Shamma, “Linear versus mel frequency cepstral coefficients for speaker recognition,” in 2011 IEEE workshop on automatic speech recognition & understanding . IEEE, 2011, pp. 559–564

work page 2011
[65]

Robust speech recognition via large-scale weak supervi- sion,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” in International conference on machine learning . PMLR, 2023, pp. 28 492–28 518

work page 2023
[66]

MesoNet: a Compact Facial Video Forgery Detection Network,

D. Afchar, V . Nozick, J. Yamagishi, and I. Echizen, “MesoNet: a Compact Facial Video Forgery Detection Network,” in 2018 IEEE International Workshop on Information Forensics and Security (WIFS) , 2018, pp. 1–7

work page 2018
[67]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449– 12 460, 2020

work page 2020
[68]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y . Bengio and Y . LeCun, Eds., 2015. [Online]. Available: http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015
[69]

RETRAIN ALL

“X,” https://x.com/chai ste/status/1757717290865283282, (Accessed: 16.10.2024). attack→ Add Background Music Add Background Noise Amplitude Modulation Autotune Bit Depth Change Echo Equalization Freq Minus Freq Plus Gaussian Noise High Pass Filter Low Pass Filter MP3 Compression Pitch Shift Reverb Silence Injection Time Stretch No Attack Mean adaptive def...

work page arXiv 2024

[1] [1]

Par- rotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation,

F. Biadsy, R. J. Weiss, P. J. Moreno, D. Kanvesky, and Y . Jia, “Par- rotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation,” in Proc. Interspeech 2019 , 2019, pp. 4115–4119

work page 2019

[2] [2]

Apple introduces new features for cognitive accessibility, along with Live Speech, Personal V oice, and Point and Speak in Magnifier,

Apple, “Apple introduces new features for cognitive accessibility, along with Live Speech, Personal V oice, and Point and Speak in Magnifier,” https://www.apple.com/newsroom/2023/05/ apple-previews-live-speech-personal-voice-and-more-new-accessibility-features/, 2023, Accessed: 17.10.2024

work page 2023

[3] [3]

How deepfake videos are used to spread disinformation - the new york times,

“How deepfake videos are used to spread disinformation - the new york times,” https://www.nytimes.com/2023/02/07/technology/ artificial-intelligence-training-deepfake.html, (Accessed: 16.10.2024)

work page 2023

[4] [4]

Explicit ai-generated images of taylor swift circulate; can she sue for defamation?

“Explicit ai-generated images of taylor swift circulate; can she sue for defamation?” https://www.scbc-law.org/post/ explicit-ai-generated-images-of-taylor-swift-circulate-can-she-sue-for-defamation, (Accessed: 16.10.2024)

work page 2024

[5] [5]

Opinion — deepfake porn sites used her image. she’s fighting back. - the new york times,

“Opinion — deepfake porn sites used her image. she’s fighting back. - the new york times,” https://www.nytimes.com/2024/04/08/opinion/ deepfake-porn-tech.html, (Accessed: 16.10.2024)

work page 2024

[6] [6]

A voice deepfake was used to scam a ceo out of $243,000,

“A voice deepfake was used to scam a ceo out of $243,000,” https://www.forbes.com/sites/jessedamiani/2019/09/03/ a-voice-deepfake-was-used-to-scam-a-ceo-out-of-243000/, (Accessed: 16.10.2024)

work page 2019

[7] [7]

Finance worker pays out $25 million after video call with deep- fake ‘chief financial officer’ — cnn,

“Finance worker pays out $25 million after video call with deep- fake ‘chief financial officer’ — cnn,” https://edition.cnn.com/2024/02/ 04/asia/deepfake-cfo-scam-hong-kong-intl-hnk/index.html, (Accessed: 16.10.2024)

work page 2024

[8] [8]

NSE CEO deepfake: NSE urges caution after fake videos of CEO Ashish Chauhan recommending stocks go viral - The Economic Times,

“NSE CEO deepfake: NSE urges caution after fake videos of CEO Ashish Chauhan recommending stocks go viral - The Economic Times,” https://economictimes.indiatimes.com/markets/stocks/ news/beware-of-deepfake-of-ceo-recommending-stocks-says-nse/ articleshow/109189329.cms?from=mdr, (Accessed: 16.10.2024)

work page arXiv 2024

[9] [9]

A deepfake video showing volodymyr zelenskyy surrendering worries experts : Npr,

“A deepfake video showing volodymyr zelenskyy surrendering worries experts : Npr,” https://www.npr.org/2022/03/16/1087062648/ deepfake-video-zelenskyy-experts-war-manipulation-ukraine-russia, (Accessed: 16.10.2024)

work page 2022

[10] [10]

ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,

X. Wang, H. Delgado, H. Tak, J.-w. Jung, H.-j. Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunen, N. Evans, K. A. Lee, and J. Yamagishi, “ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,” inASVspoof Workshop 2024 (accepted), 2024

work page 2024

[11] [11]

Add 2023: the second audio deepfake detection challenge,

J. Yi, J. Tao, R. Fu, X. Yan, C. Wang, T. Wang, C. Y . Zhang, X. Zhang, Y . Zhao, Y . Renet al., “Add 2023: the second audio deepfake detection challenge,” IJCAI 2023 Workshop on Deepfake Audio Detection (DADA 2023), 2023

work page 2023

[12] [12]

Create a replica of your voice that sounds just like you,

Eleven Labs, “Create a replica of your voice that sounds just like you,” https://elevenlabs.io/voice-cloning, 2024, Accessed: 17.10.2024

work page 2024

[13] [13]

AI V oice Cloning,

Respeecher, “AI V oice Cloning,” https://www.respeecher.com/ ai-voice-cloning, 2024, Accessed: 17.10.2024

work page 2024

[14] [14]

AI V oice Cloning: Clone your V oice in Seconds,

Resemble AI, “AI V oice Cloning: Clone your V oice in Seconds,” https: //www.resemble.ai/voice-cloning/, 2024, Accessed: 17.10.2024

work page 2024

[15] [15]

Rawnet: Advanced end-to-end deep neural network using raw waveforms for text- independent speaker verification,

J.-w. Jung, H.-s. Heo, j.-h. Kim, H.-j. Shim, and H.-j. Yu, “Rawnet: Advanced end-to-end deep neural network using raw waveforms for text- independent speaker verification,” Proc. Interspeech , pp. 1268–1272, 2019

work page 2019

[16] [16]

Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,

J. Jung, H. Heo, H. Tak, H. Shim, J. Chung, B. Lee, H. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” in ICASSP , IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings , 2022, pp. 2405–2409

work page 2022

[17] [17]

Raw Differentiable Architecture Search for Speech Deepfake and Spoofing Detection,

W. Ge, J. Patino, M. Todisco, and N. Evans, “Raw Differentiable Architecture Search for Speech Deepfake and Spoofing Detection,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 22–28

work page 2021

[18] [18]

Replay detection using cqt-based modified group delay feature and resnewt network in asvspoof 2019,

X. Cheng, M. Xu, and T. F. Zheng, “Replay detection using cqt-based modified group delay feature and resnewt network in asvspoof 2019,” in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) , 2019, pp. 540–545

work page 2019

[19] [19]

Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition,

Z. Wu, E. S. Chng, and H. Li, “Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition,” in Interspeech 2012, 2012, pp. 1700–1703

work page 2012

[20] [20]

A comparative study on recent neural spoofing countermeasures for synthetic speech detection,

X. Wang and J. Yamagishi, “A comparative study on recent neural spoofing countermeasures for synthetic speech detection,” in Interspeech 2021, 2021, pp. 4259–4263

work page 2021

[21] [21]

Stc antispoofing systems for the asvspoof2021 challenge,

A. Tomilov, A. Svishchev, M. V olkova, A. Chirkovskiy, A. Kondratev, and G. Lavrentyeva, “Stc antispoofing systems for the asvspoof2021 challenge,” in 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 61–67

work page 2021

[22] [22]

End-to-end anti-spoofing with rawnet2,

H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with rawnet2,” in IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) , 2021, pp. 6369–6373

work page 2021

[23] [23]

Towards gen- eralisable and calibrated audio deepfake detection with self-supervised representations,

O. Pascu, A. Stan, D. Oneata, E. Oneata, and H. Cucu, “Towards gen- eralisable and calibrated audio deepfake detection with self-supervised representations,” in Interspeech 2024, 2024, pp. 4828–4832

work page 2024

[24] [24]

Improved DeepFake Detection Using Whisper Features,

P. Kawa, M. Plata, M. Czuba, P. Szyma ´nski, and P. Syga, “Improved DeepFake Detection Using Whisper Features,” in Proc. INTERSPEECH 2023, 2023, pp. 4009–4013

work page 2023

[25] [25]

Adapter learning from pre-trained model for robust spoof speech detection,

H. Wu, W. Guo, S. Peng, Z. Li, and J. Zhang, “Adapter learning from pre-trained model for robust spoof speech detection,” in Interspeech 2024, 2024, pp. 2095–2099

work page 2024

[26] [26]

Exploring green AI for audio deepfake detection,

S. Saha, M. Sahidullah, and S. Das, “Exploring green AI for audio deepfake detection,” CoRR, vol. abs/2403.14290, 2024

work page arXiv 2024

[27] [27]

Automatic Speaker Verification Spoofing and Deepfake Detection Us- ing Wav2vec 2.0 and Data Augmentation,

H. Tak, M. Todisco, X. Wang, J. weon Jung, J. Yamagishi, and N. Evans, “Automatic Speaker Verification Spoofing and Deepfake Detection Us- ing Wav2vec 2.0 and Data Augmentation,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2022) , 2022, pp. 112–119

work page 2022

[28] [28]

Does audio deepfake detection generalize?

N. M. M ¨uller, P. Czempin, F. Dieckmann, A. Froghyar, and K. B¨ottinger, “Does audio deepfake detection generalize?” in Interspeech, 2022, pp. 2783–2787

work page 2022

[29] [29]

The impact of silence on speech anti-spoofing,

Y . Zhang, Z. Li, J. Lu, H. Hua, W. Wang, and P. Zhang, “The impact of silence on speech anti-spoofing,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 31, pp. 3374–3389, 2023

work page 2023

[30] [30]

Speech is silver, silence is golden: What do asvspoof- trained models really learn?

N. M. M ¨uller, F. Dieckmann, P. Czempin, R. Canals, and K. B ¨ottinger, “Speech is silver, silence is golden: What do asvspoof-trained models really learn?” ArXiv, vol. abs/2106.12914, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:235624055

work page arXiv 2021

[31] [31]

Analyzing the impact of splicing artifacts in partially fake speech signals,

V . Negroni, D. Salvi, P. Bestagini, and S. Tubaro, “Analyzing the impact of splicing artifacts in partially fake speech signals,” arXiv preprint arXiv:2408.13784, 2024

work page arXiv 2024

[32] [32]

Generalizable speech spoofing detection against silence trimming with data augmentation and multi- task meta-learning,

L. Wang, L. Yu, Y . Zhang, and H. Xie, “Generalizable speech spoofing detection against silence trimming with data augmentation and multi- task meta-learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3296–3310, 2024

work page 2024

[33] [33]

Improving copy-synthesis anti-spoofing training method with rhythm and speaker perturbation,

J. Lu, Y . Zhang, Z. Li, Z. Shang, W. Wang, and P. Zhang, “Improving copy-synthesis anti-spoofing training method with rhythm and speaker perturbation,” in Interspeech 2024, 2024, pp. 512–516

work page 2024

[34] [34]

Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?

X. Wang and J. Yamagishi, “Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?” Submitted to the 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024) , 4 2024. [Online]. Available: http://arxiv.org/pdf/2309.06014v1

work page arXiv 2024

[35] [35]

Exploring self-supervised embeddings and synthetic data augmentation for robust audio deepfake detection,

J. M. Mart ´ın-Do˜nas, A. ´Alvarez, E. Rosello, A. M. Gomez, and A. M. Peinado, “Exploring self-supervised embeddings and synthetic data augmentation for robust audio deepfake detection,” in Interspeech 2024, 2024, pp. 2085–2089

work page 2024

[36] [36]

Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,

H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” in ICASSP 2022-2022 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6382–6386

work page 2022

[37] [37]

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,

H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” in The Speaker and Language Recognition Workshop, 2022

work page 2022

[38] [38]

Self-supervised dataset pruning for efficient training in audio anti-spoofing,

A. H. Azeemi, I. A. Qazi, and A. A. Raza, “Self-supervised dataset pruning for efficient training in audio anti-spoofing,” in INTERSPEECH 2023, 2023, pp. 2773–2777

work page 2023

[39] [39]

Spoofing attack augmentation: Can differently-trained attack models improve gen- eralisation?

W. Ge, X. Wang, J. Yamagishi, M. Todisco, and N. Evans, “Spoofing attack augmentation: Can differently-trained attack models improve gen- eralisation?” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2024, pp. 12 531– 12 535

work page 2024

[40] [40]

Malafide: a novel adversarial convolutive noise attack against deepfake and spoofing detection systems,

M. Panariello, W. Ge, H. Tak, M. Todisco, and N. Evans, “Malafide: a novel adversarial convolutive noise attack against deepfake and spoofing detection systems,” in INTERSPEECH 2023, 2023, pp. 2868–2872

work page 2023

[41] [41]

Advshadow: Evading deepfake detection via adversarial shadow attack,

J. Liu, M. Zhang, J. Ke, and L. Wang, “Advshadow: Evading deepfake detection via adversarial shadow attack,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 4640–4644

work page 2024

[42] [42]

Malacopula: adversarial automatic speaker verification at- tacks using a neural-based generalised hammerstein model,

M. Todisco, M. Panariello, X. Wang, H. Delgado, K.-A. Lee, and N. Evans, “Malacopula: adversarial automatic speaker verification at- tacks using a neural-based generalised hammerstein model,” in Proc. ASVspoof Workshop 2024, 2024

work page 2024

[43] [43]

ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,

M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. H. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,” in Proc. Interspeech 2019 , 2019, pp. 1008–1012

work page 2019

[44] [44]

Deep residual neural networks for audio spoofing detection,

M. Alzantot, Z. Wang, and M. B. Srivastava, “Deep residual neural networks for audio spoofing detection,” in Interspeech 2019, 2019, pp. 1078–1082

work page 2019

[45] [45]

Does audio deepfake detection generalize?

N. M ¨uller, P. Czempin, F. Diekmann, A. Froghyar, and K. B ¨ottinger, “Does audio deepfake detection generalize?” in Interspeech 2022, 2022, pp. 2783–2787

work page 2022

[46] [46]

Attack agnostic dataset: Towards gen- eralization and stabilization of audio deepfake detection,

P. Kawa, M. Plata, and P. Syga, “Attack agnostic dataset: Towards gen- eralization and stabilization of audio deepfake detection,” in Interspeech 2022, 2022, pp. 4023–4027

work page 2022

[47] [47]

End-to-end spectro-temporal graph attention networks for speaker ver- ification anti-spoofing and speech deepfake detection,

H. Tak, J. weon Jung, J. Patino, M. Kamble, M. Todisco, and N. Evans, “End-to-end spectro-temporal graph attention networks for speaker ver- ification anti-spoofing and speech deepfake detection,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Counter- measures Challenge, 2021, pp. 1–8

work page 2021

[48] [48]

Complex-valued neural networks for voice anti-spoofing,

N. M. M ¨uller, P. Sperl, and K. B ¨ottinger, “Complex-valued neural networks for voice anti-spoofing,” in INTERSPEECH 2023 , 2023, pp. 3814–3818

work page 2023

[49] [49]

One-class knowledge distillation for spoofing speech detection,

J. Lu, Y . Zhang, W. Wang, Z. Shang, and P. Zhang, “One-class knowledge distillation for spoofing speech detection,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

work page 2024

[50] [50]

Samo: Speaker attractor multi-center one-class learning for voice anti-spoofing,

S. Ding, Y . Zhang, and Z. Duan, “Samo: Speaker attractor multi-center one-class learning for voice anti-spoofing,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

work page 2023

[51] [51]

Mlaad: The multi- language audio anti-spoofing dataset,

N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G ¨olge, T. M ¨uller, P. Syga, P. Sperl, and K. B ¨ottinger, “Mlaad: The multi- language audio anti-spoofing dataset,” International Joint Conference on Neural Networks (IJCNN) , 2024

work page 2024

[52] [52]

SpeechT5: Unified- modal encoder-decoder pre-training for spoken language processing,

J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y . Wu, S. Liu, T. Ko, Q. Li, Y . Zhang, Z. Wei, Y . Qian, J. Li, and F. Wei, “SpeechT5: Unified- modal encoder-decoder pre-training for spoken language processing,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Dublin, Ireland: Association for...

work page 2022

[53] [53]

Xtts: a massively multilingual zero-shot text-to-speech model,

E. Casanova, K. Davis, E. G ¨olge, G. G ¨oknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi, and J. Weber, “Xtts: a massively multilingual zero-shot text-to-speech model,” in Interspeech 2024, 2024, pp. 4978–4982

work page 2024

[54] [54]

Better speech synthesis through scaling,

J. Betker, “Better speech synthesis through scaling,” arXiv preprint arXiv:2305.07243, 2023

work page arXiv 2023

[55] [55]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1

work page internal anchor Pith review Pith/arXiv arXiv 2015

[56] [56]

Free music archive - instrumental,

F. M. Archive, “Free music archive - instrumental,” https:// freemusicarchive.org/genre/Instrumental/, 2024, accessed: 10.10.2024

work page 2024

[57] [57]

ESC: Dataset for Environmental Sound Classification,

K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proceedings of the 23rd Annual ACM Conference on Multimedia . ACM Press, 2015, pp. 1015–1018. [Online]. Available: http://dl.acm. org/citation.cfm?doid=2733373.2806390

work page arXiv 2015

[58] [58]

Simple auto-tune in python,

J. Wilczek, “Simple auto-tune in python,” https://github.com/ JanWilczek/python-auto-tune, 2023, accessed: 10.10.2024

work page 2023

[59] [59]

Robert, “Pydub,” https://github.com/jiaaro/pydub, 2024, accessed: 10.10.2024

J. Robert, “Pydub,” https://github.com/jiaaro/pydub, 2024, accessed: 10.10.2024

work page 2024

[60] [60]

librosa: Audio and music signal analysis in python

B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python.” in SciPy, 2015, pp. 18–24

work page 2015

[61] [61]

DARTS: Differentiable architecture search,

H. Liu, K. Simonyan, and Y . Yang, “DARTS: Differentiable architecture search,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=S1eYHoC5FX

work page 2019

[62] [62]

Speaker recognition from raw waveform with sincnet,

M. Ravanelli and Y . Bengio, “Speaker recognition from raw waveform with sincnet,” in 2018 IEEE spoken language technology workshop (SLT). IEEE, 2018, pp. 1021–1028

work page 2018

[63] [63]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[64] [64]

Linear versus mel frequency cepstral coefficients for speaker recognition,

X. Zhou, D. Garcia-Romero, R. Duraiswami, C. Espy-Wilson, and S. Shamma, “Linear versus mel frequency cepstral coefficients for speaker recognition,” in 2011 IEEE workshop on automatic speech recognition & understanding . IEEE, 2011, pp. 559–564

work page 2011

[65] [65]

Robust speech recognition via large-scale weak supervi- sion,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” in International conference on machine learning . PMLR, 2023, pp. 28 492–28 518

work page 2023

[66] [66]

MesoNet: a Compact Facial Video Forgery Detection Network,

D. Afchar, V . Nozick, J. Yamagishi, and I. Echizen, “MesoNet: a Compact Facial Video Forgery Detection Network,” in 2018 IEEE International Workshop on Information Forensics and Security (WIFS) , 2018, pp. 1–7

work page 2018

[67] [67]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449– 12 460, 2020

work page 2020

[68] [68]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y . Bengio and Y . LeCun, Eds., 2015. [Online]. Available: http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015

[69] [69]

RETRAIN ALL

“X,” https://x.com/chai ste/status/1757717290865283282, (Accessed: 16.10.2024). attack→ Add Background Music Add Background Noise Amplitude Modulation Autotune Bit Depth Change Echo Equalization Freq Minus Freq Plus Gaussian Noise High Pass Filter Low Pass Filter MP3 Compression Pitch Shift Reverb Silence Injection Time Stretch No Attack Mean adaptive def...

work page arXiv 2024