pith. sign in

arxiv: 2606.22310 · v1 · pith:USJTXSP2new · submitted 2026-06-21 · 💻 cs.SD · cs.CR

Learning to Evade: Adaptive Attacks on Audio Watermarking

Pith reviewed 2026-06-26 10:08 UTC · model grok-4.3

classification 💻 cs.SD cs.CR
keywords audio watermarkingadversarial attacksadaptive attackswatermark evasionaudio copyrightgenerative audiodetection bypass
0
0 comments X

The pith

An adaptive attack steers audio watermark decoder probabilities into estimated normal ranges to evade detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AWM, a method that attacks audio watermarks by first optimizing for success in replacing, creating, or removing marks and then refining audio quality. It exploits the fact that decoder message probabilities follow normal distributions, which existing detectors use to spot manipulations. By estimating those distribution parameters from a few samples of the target audio, AWM steers the attack outputs back inside the expected range. This matters because generative audio tools heighten the need for reliable ownership marks, yet the approach drops detection rates below 10 percent for replacement and creation and to zero for removal across tested methods and datasets.

Core claim

Watermark decoder message probabilities follow normal distributions, a property that can be estimated from limited samples of the target audio. AWM uses a two-stage optimization where the first stage ensures attack success on the watermark and the second improves audio quality, while adaptively steering decoded probabilities into the estimated normal range to bypass detectors.

What carries the argument

Two-stage optimization that estimates normal distribution parameters from limited target audio samples and steers decoded message probabilities back into the estimated range.

If this is right

  • AWM succeeds against two different watermarking methods on three voice datasets.
  • Detection rates fall below 10 percent for replacement and creation attacks.
  • Removal attacks achieve 0 percent detection rate.
  • The method maintains high attack success while producing usable audio output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Watermark systems may need to adopt output distributions that resist quick estimation from small sample sets.
  • Detectors could add checks that do not rely solely on normality assumptions.
  • Real-world watermark deployment in generative audio tools may require ongoing adaptation against such steering attacks.

Load-bearing premise

Watermark decoder message probabilities follow normal distributions that can be estimated accurately from limited samples of the target audio.

What would settle it

An experiment showing that the attack no longer keeps detection rates low when the probability outputs deviate from normality or when accurate parameter estimates require far more than a few samples.

Figures

Figures reproduced from arXiv: 2606.22310 by Guangjing Wang, Hanqing Guo, Mingzhe Chen, Qiben Yan, Rui Duan, Weikang Ding, Yuanda Wang.

Figure 1
Figure 1. Figure 1: Overview of the watermark attack (left) and the water￾mark attack detection process used to detect whether the audio has been tampered with (right). ical concern [7]. Recent studies [8, 9] show that attackers can remove or forge watermarks by embedding carefully crafted ad￾versarial perturbations. These attacks operate by adding and optimizing a perturbation signal so that the watermark decoder is misled i… view at source ↗
Figure 2
Figure 2. Figure 2: (a), the message distribution under attack differs in the range of message probabilities along the x-axis, even though it still exhibits a unimodal normal distribution [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of message probabilities under attacks (Timbre). We use the AudioMarkBench to perform the attacks. sume that: 1) attackers have no access to the data used by de￾fenders to fit the distribution, nor to the audio dataset used to train the watermarking model; 2) they have no access to the ground-truth watermark message embedded in the target audio; but they can get the message probabilities from … view at source ↗
Figure 5
Figure 5. Figure 5: The distribution estimation by the attacker. (a) Water￾mark replacement and creation: The attacker uses a small set of clean audio samples to generate watermarked audio samples, which are then used to estimate the distribution. (b) Watermark removal: The attacker directly uses the clean audio samples to estimate the distribution. Under the null hypothesis, zi follows the standard normal distribution, and t… view at source ↗
Figure 6
Figure 6. Figure 6: The design of AWM generator. The audio watermark attack step (left) ensures the success of the watermark attack, while the audio quality optimization step (right) focuses on improving audio quality. Parameter Estimation for Watermark Removal. The attack￾ers estimate the distribution directly using clean audio sam￾ples sc. They use the watermark decoder to output message probabilities, which are used for di… view at source ↗
Figure 8
Figure 8. Figure 8: Message probabilities distribution comparisons be￾tween AudioMarkBench and AWM for the watermark creation. AudioMarkBench Ours Ours (+opt) [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The spectrograms of the watermark creation in Au￾dioSeal. The dotted red box shows noticeable noise for AWM attack. The green box indicates that AWM (+opt) attack achieves higher audio quality and is visually similar to Au￾dioMarkBench. the attacked audio: (1) low-pass filtering (LP): fixed at 5000Hz, (2) amplitude scaling (AS): fixed at scale of 0.9, (3) Gaussian noise (GN): fixed at 40 dB, (4) MP3 compre… view at source ↗
Figure 10
Figure 10. Figure 10: ASR after applying different no-box perturbations on the watermark creation attacks. It evaluates the ASR results of watermark creation against the no-box perturbations, which uses the Librispeech dataset for validation. of the attack variant “Ours (+opt)”, which indicates that improv￾ing the audio quality will increase the detection risk. Since the DSR of the attacked audio closely matches that of ground… view at source ↗
read the original abstract

Advances in generative audio have intensified copyright concerns, making audio watermarking increasingly important for asserting ownership. However, existing audio watermarking methods are vulnerable to adversarial attacks. We find that watermark decoder message probabilities follow normal distributions, a property exploited by defenses to detect manipulations. This paper introduces an adaptive audio watermark attack method (AWM) designed to bypass existing defense strategies. AWM uses a two-stage optimization: the first stage ensures attack success, while the second improves audio quality. To evade detection, it estimates normal distribution parameters from limited samples of the target audio, and then adaptively steers decoded probabilities back into the estimated range. Evaluated on two watermarking methods across three voice datasets, AWM achieves high success while bypassing state-of-the-art detectors: detection rates are below 10% for replacement and creation, and 0% for removal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents an adaptive attack method called AWM for evading audio watermarking detectors. It exploits the observation that watermark decoder message probabilities follow normal distributions by estimating their parameters from limited target audio samples and steering the decoded probabilities into the estimated range during a two-stage optimization process. The first stage focuses on attack success, the second on audio quality. Evaluations on two watermarking methods and three voice datasets show high attack success with detection rates below 10% for replacement and creation attacks and 0% for removal attacks.

Significance. Should the central claims be substantiated, particularly the normality assumption and the empirical evasion results, the paper would make a notable contribution to the field of audio watermarking security by illustrating how adaptive attacks can circumvent existing detectors. This could prompt the development of more robust watermarking techniques that do not rely on distributional assumptions vulnerable to estimation from limited samples.

major comments (2)
  1. [Abstract] Abstract: The evasion strategy depends on the assumption that watermark decoder message probabilities follow normal distributions estimable from limited target samples; no statistical tests, Q-Q plots, or validation for this normality (for either of the two evaluated watermarking methods) are referenced, yet this property is required for the second-stage steering to produce the claimed detection rates below 10% without degrading first-stage attack success.
  2. [Abstract] Abstract: The reported detection rates (below 10% for replacement/creation, 0% for removal) rest on the accuracy of parameter estimation from 'limited samples'; the manuscript supplies no sample sizes, variance of the estimates, or ablation on how estimation error affects evasion, which directly bears on whether the low detection rates are reproducible or an artifact of the specific datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The evasion strategy depends on the assumption that watermark decoder message probabilities follow normal distributions estimable from limited target samples; no statistical tests, Q-Q plots, or validation for this normality (for either of the two evaluated watermarking methods) are referenced, yet this property is required for the second-stage steering to produce the claimed detection rates below 10% without degrading first-stage attack success.

    Authors: We acknowledge that the current manuscript does not include explicit statistical validation (e.g., Q-Q plots or formal normality tests) for the observed normal distribution of decoder probabilities. In the revised version we will add Q-Q plots and Shapiro-Wilk test results for both watermarking methods across the evaluated datasets to substantiate the assumption. revision: yes

  2. Referee: [Abstract] Abstract: The reported detection rates (below 10% for replacement/creation, 0% for removal) rest on the accuracy of parameter estimation from 'limited samples'; the manuscript supplies no sample sizes, variance of the estimates, or ablation on how estimation error affects evasion, which directly bears on whether the low detection rates are reproducible or an artifact of the specific datasets.

    Authors: We agree that details on sample sizes for distribution estimation, variance of the parameter estimates, and sensitivity analysis are missing. The revised manuscript will report the exact number of samples used per dataset, include variance/confidence intervals for the estimated parameters, and add an ablation study examining how estimation error propagates to final detection rates. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack method with independent evaluation

full rationale

The paper presents an adaptive attack (AWM) that observes normality in decoder probabilities, estimates parameters from limited target samples, and steers outputs to evade detectors. Success is reported via measured detection rates (<10% replacement/creation, 0% removal) on evaluated datasets and watermarking methods. No equations, derivations, or self-citations reduce these outcomes to fitted inputs by construction; the estimation step is a practical heuristic whose effectiveness is tested externally rather than assumed tautologically. The central claim remains falsifiable through the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that decoder probabilities are normally distributed and can be reliably estimated from limited samples; no free parameters or invented entities are visible in the abstract.

axioms (1)
  • domain assumption Watermark decoder message probabilities follow normal distributions that defenses exploit for manipulation detection
    Stated as a finding used to design the evasion strategy in the abstract.

pith-pipeline@v0.9.1-grok · 5688 in / 1175 out tokens · 27516 ms · 2026-06-26T10:08:08.212176+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 13 canonical work pages · 4 internal anchors

  1. [1]

    Learning to Evade: Adaptive Attacks on Audio Watermarking

    Introduction In recent years, the rapid growth of social networking plat- forms has encouraged many users to publicly share their audio content, including original works such as audiobooks and self- produced music. These audio contents might bring them in- come. However, many unauthorized users copy creative works, modify them, and re-upload them to mains...

  2. [2]

    These schemes follow the architecture of the Encoder- Distortion-Decoder

    Related Work Deep Learning-Based Audio Watermarking.Unlike tradi- tional schemes [13], which rely on predefined transformations, deep-learning-based schemes can learn complex feature repre- sentations and optimize watermarking dynamically [4, 5, 14, 15, 16]. These schemes follow the architecture of the Encoder- Distortion-Decoder. The encoder embeds the w...

  3. [3]

    Preliminary Adding perturbations to the audio is a common strategy for at- tacks against watermarking systems

    Background 3.1. Preliminary Adding perturbations to the audio is a common strategy for at- tacks against watermarking systems. The core idea is to either destroy the original watermark or forge a new one by introduc- ing perturbations that deceive the watermark decoder. Audio Watermark Decoder .The audio watermark decoder Dec(·)takes the encoded audio as ...

  4. [4]

    The defenders are the attack detectors, who identify whether audio has been tampered with

    they are aware that the decoded message probabilities output by the watermark decoder follow a normal distribution, but they do not know the corresponding mean and standard deviation. The defenders are the attack detectors, who identify whether audio has been tampered with. We assume that: 1) defenders have access to a large number of ground-truth audio s...

  5. [5]

    attacked

    Methodology In this section, we first introduce a detection method based on outlier detection to identify if the given audio sample has been attacked. Second, we design our adaptive attack, which aims to achieve a successful attack while preserving perceptual quality. Meanwhile, we describe the adaptive optimization process for the three attack types. Det...

  6. [6]

    For the indices not included inmsg dif f, we assign the original watermark message probabilitiesp w to the corresponding target watermark message probabilitiesp t

    In this case, msgdif f = [1,3]. For the indices not included inmsg dif f, we assign the original watermark message probabilitiesp w to the corresponding target watermark message probabilitiesp t. That is, for indices[0,2,4,5], thep t is equal to thep w. This optimization has two advantages: (1) It directs the gra- dient to focus more on the indices where ...

  7. [7]

    Watermark

    Evaluation 5.1. Experimental Setup Datasets.We use three public datasets for our experiments. The first dataset is the LibriSpeech [36]. We select the small-sized subset, which has 6.3G audios, and covers 100.6 hours of audio data spoken by 251 speakers. The second dataset is obtained from AudioMarkData [8], which is built based on the Common V oice datas...

  8. [8]

    We show that defenders can leverage this statis- tical regularity to distinguish benign from attacked signals

    Conclusion In this work, we analyze watermark decoder outputs and ob- serve that they exhibit approximately normal distributions on benign audio. We show that defenders can leverage this statis- tical regularity to distinguish benign from attacked signals. To demonstrate the fragility of such defenses, we introduce AWM, an adaptive audio watermark attack....

  9. [9]

    National Science Foundation under Grant CNS-2310207, CNS-2520900, CNS- 2451168

    Acknowledgments This work was supported in part by the U.S. National Science Foundation under Grant CNS-2310207, CNS-2520900, CNS- 2451168

  10. [10]

    All tech- nical contents were conceived, conducted, and verified by the authors, which include the methodology, experiments, analysis, etc

    Generative AI Use Disclosure The authors used generative AI tools such as ChatGPT solely for grammar improvement and language polishing. All tech- nical contents were conceived, conducted, and verified by the authors, which include the methodology, experiments, analysis, etc. The authors take full responsibility for the content of the publication

  11. [11]

    (2023, May) Beware the artificial impostor

    McAfee. (2023, May) Beware the artificial impostor. a mcafee cybersecurity artificial intelligence report. [On- line]. Available: https://www.mcafee.com/content/dam/ consumer/en-us/resources/cybersecurity/artificial-intelligence/ rp-beware-the-artificial-impostor-report.pdf

  12. [12]

    Copyright protection in generative ai: A technical perspective,

    J. Ren, H. Xu, P. He, Y . Cui, S. Zeng, J. Zhang, H. Wen, J. Ding, P. Huang, L. Lyuet al., “Copyright protection in generative ai: A technical perspective,”arXiv preprint arXiv:2402.02333, 2024

  13. [13]

    Dear: A deep-learning-based audio re-recording resilient watermarking,

    C. Liu, J. Zhang, H. Fang, Z. Ma, W. Zhang, and N. Yu, “Dear: A deep-learning-based audio re-recording resilient watermarking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp. 13 201–13 209

  14. [14]

    De- tecting voice cloning attacks via timbre watermarking,

    C. Liu, J. Zhang, T. Zhang, X. Yang, W. Zhang, and N. Yu, “De- tecting voice cloning attacks via timbre watermarking,” inNet- work and Distributed System Security Symposium, 2024

  15. [15]

    Proactive detection of voice cloning with localized watermarking,

    R. San Roman, P. Fernandez, H. Elsahar, A. D´efossez, T. Furon, and T. Tran, “Proactive detection of voice cloning with localized watermarking,”ICML, 2024

  16. [16]

    Wm- codec: End-to-end neural speech codec with deep watermarking for authenticity verification,

    J. Zhou, J. Yi, Y . Ren, J. Tao, T. Wang, and C. Y . Zhang, “Wm- codec: End-to-end neural speech codec with deep watermarking for authenticity verification,” inICASSP 2025-2025 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  17. [17]

    Sok: How robust is audio watermarking in generative ai models?

    Y . Wen, A. Innuganti, A. B. Ramos, H. Guo, and Q. Yan, “Sok: How robust is audio watermarking in generative ai models?” arXiv preprint arXiv:2503.19176, 2025

  18. [18]

    Audiomark- bench: Benchmarking robustness of audio watermarking,

    H. Liu, M. Guo, Z. Jiang, L. Wang, and N. Gong, “Audiomark- bench: Benchmarking robustness of audio watermarking,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 52 241–52 265, 2024

  19. [19]

    Can simple averag- ing defeat modern watermarks?

    P. Yang, H. Ci, Y . Song, and M. Z. Shou, “Can simple averag- ing defeat modern watermarks?”Advances in Neural Information Processing Systems, vol. 37, pp. 56 644–56 673, 2024

  20. [20]

    Sok: How robust is image classification deep neural network watermarking?

    N. Lukas, E. Jiang, X. Li, and F. Kerschbaum, “Sok: How robust is image classification deep neural network watermarking?” in 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 787–804

  21. [21]

    {ASSET}: Ro- bust backdoor data detection across a multiplicity of deep learn- ing paradigms,

    M. Pan, Y . Zeng, L. Lyu, X. Lin, and R. Jia, “{ASSET}: Ro- bust backdoor data detection across a multiplicity of deep learn- ing paradigms,” in32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 2725–2742

  22. [22]

    Find- ing needles in a haystack: A black-box approach to invisible wa- termark detection,

    M. Pan, Z. Wang, X. Dong, V . Sehwag, L. Lyu, and X. Lin, “Find- ing needles in a haystack: A black-box approach to invisible wa- termark detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 253–270

  23. [23]

    Twenty years of digital audio watermarking—a comprehensive review,

    G. Hua, J. Huang, Y . Q. Shi, J. Goh, and V . L. Thing, “Twenty years of digital audio watermarking—a comprehensive review,” Signal processing, vol. 128, pp. 222–242, 2016

  24. [24]

    Wav- mark: Watermarking for audio generation,

    G. Chen, Y . Wu, S. Liu, T. Liu, X. Du, and F. Wei, “Wav- mark: Watermarking for audio generation,”arXiv preprint arXiv:2308.12770, 2023

  25. [25]

    Audio codec augmentation for robust col- laborative watermarking of speech synthesis,

    L. Juvela and X. Wang, “Audio codec augmentation for robust col- laborative watermarking of speech synthesis,” inICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  26. [26]

    Silent- cipher: Deep audio watermarking

    M. K. Singh, N. Takahashi, W.-H. Liao, and Y . Mitsufuji, “Silent- cipher: Deep audio watermarking.” inINTERSPEECH, 2024

  27. [27]

    High Fidelity Neural Audio Compression

    A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,”arXiv preprint arXiv:2210.13438, 2022

  28. [28]

    One-shot voice conversion by separating speaker and content representations with instance normalization,

    J.-c. Chou, C.-c. Yeh, and H.-y. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,”arXiv preprint arXiv:1904.05742, 2019

  29. [29]

    Fragmentvc: Any-to-any voice conversion by end-to-end extract- ing and fusing fine-grained voice fragments with attention,

    Y . Y . Lin, C.-M. Chien, J.-H. Lin, H.-y. Lee, and L.-s. Lee, “Fragmentvc: Any-to-any voice conversion by end-to-end extract- ing and fusing fine-grained voice fragments with attention,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5939– 5943

  30. [30]

    Natural tts synthesis by conditioning wavenet on mel spectrogram pre- dictions,

    J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerrv-Ryanet al., “Natural tts synthesis by conditioning wavenet on mel spectrogram pre- dictions,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 4779– 4783

  31. [31]

    Fastspeech 2: Fast and high-quality end-to-end text to speech,

    Y . Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y . Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020

  32. [32]

    Hopskipjumpattack: A query-efficient decision-based attack,

    J. Chen, M. I. Jordan, and M. J. Wainwright, “Hopskipjumpattack: A query-efficient decision-based attack,” in2020 ieee symposium on security and privacy (sp). IEEE, 2020, pp. 1277–1294

  33. [33]

    Square attack: a query-efficient black-box adversarial attack via random search,

    M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein, “Square attack: a query-efficient black-box adversarial attack via random search,” inEuropean conference on computer vision. Springer, 2020, pp. 484–501

  34. [34]

    Towards evaluating the robustness of neural networks,

    N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in2017 ieee symposium on security and privacy (sp). Ieee, 2017, pp. 39–57

  35. [35]

    Towards Deep Learning Models Resistant to Adversarial Attacks

    A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017

  36. [36]

    Evading watermark based de- tection of ai-generated content,

    Z. Jiang, J. Zhang, and N. Z. Gong, “Evading watermark based de- tection of ai-generated content,” inProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, 2023, pp. 1168–1181

  37. [37]

    Clap learning audio concepts from natural language supervision,

    B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  38. [38]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

  39. [39]

    A re- view on outlier/anomaly detection in time series data,

    A. Bl ´azquez-Garc´ıa, A. Conde, U. Mori, and J. A. Lozano, “A re- view on outlier/anomaly detection in time series data,”ACM com- puting surveys (CSUR), vol. 54, no. 3, pp. 1–33, 2021

  40. [40]

    TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis

    H. Wu, T. Hu, Y . Liu, H. Zhou, J. Wang, and M. Long, “Timesnet: Temporal 2d-variation modeling for general time series analysis,” arXiv preprint arXiv:2210.02186, 2022

  41. [41]

    Dire for diffusion-generated image detection,

    Z. Wang, J. Bao, W. Zhou, W. Wang, H. Hu, H. Chen, and H. Li, “Dire for diffusion-generated image detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 445–22 455

  42. [42]

    Organic or diffused: Can we dis- tinguish human art from ai-generated images?

    A. Y . J. Ha, J. Passananti, R. Bhaskar, S. Shan, R. Southen, H. Zheng, and B. Y . Zhao, “Organic or diffused: Can we dis- tinguish human art from ai-generated images?” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Commu- nications Security, 2024, pp. 4822–4836

  43. [43]

    Detect- ing music deepfakes is easy but actually hard,

    D. Afchar, G. Meseguer-Brocal, and R. Hennequin, “Detect- ing music deepfakes is easy but actually hard,”arXiv preprint arXiv:2405.04181, 2024

  44. [44]

    Singfake: Singing voice deepfake detection,

    Y . Zang, Y . Zhang, M. Heydari, and Z. Duan, “Singfake: Singing voice deepfake detection,” inICASSP 2024-2024 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 156–12 160

  45. [45]

    Conjugate bayesian analysis of the gaussian dis- tribution,

    K. P. Murphy, “Conjugate bayesian analysis of the gaussian dis- tribution,”def, vol. 1, no. 2σ2, p. 16, 2007

  46. [46]

    Lib- rispeech: an asr corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210

  47. [47]

    Common voice: A massively-multilingual speech corpus,

    R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,”arXiv preprint arXiv:1912.06670, 2019

  48. [48]

    Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,

    G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhanget al., “Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” arXiv preprint arXiv:2106.06909, 2021

  49. [49]

    Visqol: an objective speech quality model,

    A. Hines, J. Skoglund, A. C. Kokaram, and N. Harte, “Visqol: an objective speech quality model,”EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015, pp. 1–18, 2015