Learning to Evade: Adaptive Attacks on Audio Watermarking

Guangjing Wang; Hanqing Guo; Mingzhe Chen; Qiben Yan; Rui Duan; Weikang Ding; Yuanda Wang

arxiv: 2606.22310 · v1 · pith:USJTXSP2new · submitted 2026-06-21 · 💻 cs.SD · cs.CR

Learning to Evade: Adaptive Attacks on Audio Watermarking

Weikang Ding , Hanqing Guo , Rui Duan , Guangjing Wang , Yuanda Wang , Mingzhe Chen , Qiben Yan This is my paper

Pith reviewed 2026-06-26 10:08 UTC · model grok-4.3

classification 💻 cs.SD cs.CR

keywords audio watermarkingadversarial attacksadaptive attackswatermark evasionaudio copyrightgenerative audiodetection bypass

0 comments

The pith

An adaptive attack steers audio watermark decoder probabilities into estimated normal ranges to evade detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AWM, a method that attacks audio watermarks by first optimizing for success in replacing, creating, or removing marks and then refining audio quality. It exploits the fact that decoder message probabilities follow normal distributions, which existing detectors use to spot manipulations. By estimating those distribution parameters from a few samples of the target audio, AWM steers the attack outputs back inside the expected range. This matters because generative audio tools heighten the need for reliable ownership marks, yet the approach drops detection rates below 10 percent for replacement and creation and to zero for removal across tested methods and datasets.

Core claim

Watermark decoder message probabilities follow normal distributions, a property that can be estimated from limited samples of the target audio. AWM uses a two-stage optimization where the first stage ensures attack success on the watermark and the second improves audio quality, while adaptively steering decoded probabilities into the estimated normal range to bypass detectors.

What carries the argument

Two-stage optimization that estimates normal distribution parameters from limited target audio samples and steers decoded message probabilities back into the estimated range.

If this is right

AWM succeeds against two different watermarking methods on three voice datasets.
Detection rates fall below 10 percent for replacement and creation attacks.
Removal attacks achieve 0 percent detection rate.
The method maintains high attack success while producing usable audio output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Watermark systems may need to adopt output distributions that resist quick estimation from small sample sets.
Detectors could add checks that do not rely solely on normality assumptions.
Real-world watermark deployment in generative audio tools may require ongoing adaptation against such steering attacks.

Load-bearing premise

Watermark decoder message probabilities follow normal distributions that can be estimated accurately from limited samples of the target audio.

What would settle it

An experiment showing that the attack no longer keeps detection rates low when the probability outputs deviate from normality or when accurate parameter estimates require far more than a few samples.

Figures

Figures reproduced from arXiv: 2606.22310 by Guangjing Wang, Hanqing Guo, Mingzhe Chen, Qiben Yan, Rui Duan, Weikang Ding, Yuanda Wang.

**Figure 1.** Figure 1: Overview of the watermark attack (left) and the watermark attack detection process used to detect whether the audio has been tampered with (right). ical concern [7]. Recent studies [8, 9] show that attackers can remove or forge watermarks by embedding carefully crafted adversarial perturbations. These attacks operate by adding and optimizing a perturbation signal so that the watermark decoder is misled i… view at source ↗

**Figure 2.** Figure 2: (a), the message distribution under attack differs in the range of message probabilities along the x-axis, even though it still exhibits a unimodal normal distribution [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of message probabilities under attacks (Timbre). We use the AudioMarkBench to perform the attacks. sume that: 1) attackers have no access to the data used by defenders to fit the distribution, nor to the audio dataset used to train the watermarking model; 2) they have no access to the ground-truth watermark message embedded in the target audio; but they can get the message probabilities from … view at source ↗

**Figure 5.** Figure 5: The distribution estimation by the attacker. (a) Watermark replacement and creation: The attacker uses a small set of clean audio samples to generate watermarked audio samples, which are then used to estimate the distribution. (b) Watermark removal: The attacker directly uses the clean audio samples to estimate the distribution. Under the null hypothesis, zi follows the standard normal distribution, and t… view at source ↗

**Figure 6.** Figure 6: The design of AWM generator. The audio watermark attack step (left) ensures the success of the watermark attack, while the audio quality optimization step (right) focuses on improving audio quality. Parameter Estimation for Watermark Removal. The attackers estimate the distribution directly using clean audio samples sc. They use the watermark decoder to output message probabilities, which are used for di… view at source ↗

**Figure 8.** Figure 8: Message probabilities distribution comparisons between AudioMarkBench and AWM for the watermark creation. AudioMarkBench Ours Ours (+opt) [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: The spectrograms of the watermark creation in AudioSeal. The dotted red box shows noticeable noise for AWM attack. The green box indicates that AWM (+opt) attack achieves higher audio quality and is visually similar to AudioMarkBench. the attacked audio: (1) low-pass filtering (LP): fixed at 5000Hz, (2) amplitude scaling (AS): fixed at scale of 0.9, (3) Gaussian noise (GN): fixed at 40 dB, (4) MP3 compre… view at source ↗

**Figure 10.** Figure 10: ASR after applying different no-box perturbations on the watermark creation attacks. It evaluates the ASR results of watermark creation against the no-box perturbations, which uses the Librispeech dataset for validation. of the attack variant “Ours (+opt)”, which indicates that improving the audio quality will increase the detection risk. Since the DSR of the attacked audio closely matches that of ground… view at source ↗

read the original abstract

Advances in generative audio have intensified copyright concerns, making audio watermarking increasingly important for asserting ownership. However, existing audio watermarking methods are vulnerable to adversarial attacks. We find that watermark decoder message probabilities follow normal distributions, a property exploited by defenses to detect manipulations. This paper introduces an adaptive audio watermark attack method (AWM) designed to bypass existing defense strategies. AWM uses a two-stage optimization: the first stage ensures attack success, while the second improves audio quality. To evade detection, it estimates normal distribution parameters from limited samples of the target audio, and then adaptively steers decoded probabilities back into the estimated range. Evaluated on two watermarking methods across three voice datasets, AWM achieves high success while bypassing state-of-the-art detectors: detection rates are below 10% for replacement and creation, and 0% for removal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new piece is a two-stage attack that estimates normal params on decoder probs from few target samples to steer evasion after the first stage succeeds, but the abstract gives no verification of normality or sample-size details.

read the letter

The main thing to know is that this work describes an adaptive attack (AWM) on audio watermarking that splits into two optimization stages. The first gets the replacement, creation, or removal to succeed. The second stage then estimates the mean and variance of the watermark decoder's message probabilities from a small number of target audio samples and steers the attacked output back inside that estimated normal range so that existing detectors do not flag it.

They report results on two watermarking schemes and three voice datasets, with detection rates falling below 10 % for replacement and creation and to 0 % for removal. That is the concrete claim.

The soft spot is exactly the one the stress-test flags: everything depends on the probabilities actually being normal and on the limited-sample estimate being accurate enough that the steering does not break the first-stage success or get caught another way. The abstract supplies no plots, no Kolmogorov-Smirnov numbers, no ablation on how many samples are needed for the estimate, and no error bars on the detection rates. Without those, it is impossible to judge whether the evasion is robust or whether it only works under the conditions they happened to test.

If the full paper contains the distribution checks, the sample-size experiments, and released code, then the result is worth a referee's time for the audio-security community. The method is a direct response to the detection technique used in prior defenses, so the engagement with the literature looks honest. I would send it out for review but would ask the authors to add the missing verification steps before acceptance.

Referee Report

2 major / 0 minor

Summary. The manuscript presents an adaptive attack method called AWM for evading audio watermarking detectors. It exploits the observation that watermark decoder message probabilities follow normal distributions by estimating their parameters from limited target audio samples and steering the decoded probabilities into the estimated range during a two-stage optimization process. The first stage focuses on attack success, the second on audio quality. Evaluations on two watermarking methods and three voice datasets show high attack success with detection rates below 10% for replacement and creation attacks and 0% for removal attacks.

Significance. Should the central claims be substantiated, particularly the normality assumption and the empirical evasion results, the paper would make a notable contribution to the field of audio watermarking security by illustrating how adaptive attacks can circumvent existing detectors. This could prompt the development of more robust watermarking techniques that do not rely on distributional assumptions vulnerable to estimation from limited samples.

major comments (2)

[Abstract] Abstract: The evasion strategy depends on the assumption that watermark decoder message probabilities follow normal distributions estimable from limited target samples; no statistical tests, Q-Q plots, or validation for this normality (for either of the two evaluated watermarking methods) are referenced, yet this property is required for the second-stage steering to produce the claimed detection rates below 10% without degrading first-stage attack success.
[Abstract] Abstract: The reported detection rates (below 10% for replacement/creation, 0% for removal) rest on the accuracy of parameter estimation from 'limited samples'; the manuscript supplies no sample sizes, variance of the estimates, or ablation on how estimation error affects evasion, which directly bears on whether the low detection rates are reproducible or an artifact of the specific datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The evasion strategy depends on the assumption that watermark decoder message probabilities follow normal distributions estimable from limited target samples; no statistical tests, Q-Q plots, or validation for this normality (for either of the two evaluated watermarking methods) are referenced, yet this property is required for the second-stage steering to produce the claimed detection rates below 10% without degrading first-stage attack success.

Authors: We acknowledge that the current manuscript does not include explicit statistical validation (e.g., Q-Q plots or formal normality tests) for the observed normal distribution of decoder probabilities. In the revised version we will add Q-Q plots and Shapiro-Wilk test results for both watermarking methods across the evaluated datasets to substantiate the assumption. revision: yes
Referee: [Abstract] Abstract: The reported detection rates (below 10% for replacement/creation, 0% for removal) rest on the accuracy of parameter estimation from 'limited samples'; the manuscript supplies no sample sizes, variance of the estimates, or ablation on how estimation error affects evasion, which directly bears on whether the low detection rates are reproducible or an artifact of the specific datasets.

Authors: We agree that details on sample sizes for distribution estimation, variance of the parameter estimates, and sensitivity analysis are missing. The revised manuscript will report the exact number of samples used per dataset, include variance/confidence intervals for the estimated parameters, and add an ablation study examining how estimation error propagates to final detection rates. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack method with independent evaluation

full rationale

The paper presents an adaptive attack (AWM) that observes normality in decoder probabilities, estimates parameters from limited target samples, and steers outputs to evade detectors. Success is reported via measured detection rates (<10% replacement/creation, 0% removal) on evaluated datasets and watermarking methods. No equations, derivations, or self-citations reduce these outcomes to fitted inputs by construction; the estimation step is a practical heuristic whose effectiveness is tested externally rather than assumed tautologically. The central claim remains falsifiable through the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that decoder probabilities are normally distributed and can be reliably estimated from limited samples; no free parameters or invented entities are visible in the abstract.

axioms (1)

domain assumption Watermark decoder message probabilities follow normal distributions that defenses exploit for manipulation detection
Stated as a finding used to design the evasion strategy in the abstract.

pith-pipeline@v0.9.1-grok · 5688 in / 1175 out tokens · 27516 ms · 2026-06-26T10:08:08.212176+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 13 canonical work pages · 4 internal anchors

[1]

Learning to Evade: Adaptive Attacks on Audio Watermarking

Introduction In recent years, the rapid growth of social networking plat- forms has encouraged many users to publicly share their audio content, including original works such as audiobooks and self- produced music. These audio contents might bring them in- come. However, many unauthorized users copy creative works, modify them, and re-upload them to mains...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

These schemes follow the architecture of the Encoder- Distortion-Decoder

Related Work Deep Learning-Based Audio Watermarking.Unlike tradi- tional schemes [13], which rely on predefined transformations, deep-learning-based schemes can learn complex feature repre- sentations and optimize watermarking dynamically [4, 5, 14, 15, 16]. These schemes follow the architecture of the Encoder- Distortion-Decoder. The encoder embeds the w...
[3]

Preliminary Adding perturbations to the audio is a common strategy for at- tacks against watermarking systems

Background 3.1. Preliminary Adding perturbations to the audio is a common strategy for at- tacks against watermarking systems. The core idea is to either destroy the original watermark or forge a new one by introduc- ing perturbations that deceive the watermark decoder. Audio Watermark Decoder .The audio watermark decoder Dec(·)takes the encoded audio as ...
[4]

The defenders are the attack detectors, who identify whether audio has been tampered with

they are aware that the decoded message probabilities output by the watermark decoder follow a normal distribution, but they do not know the corresponding mean and standard deviation. The defenders are the attack detectors, who identify whether audio has been tampered with. We assume that: 1) defenders have access to a large number of ground-truth audio s...
[5]

attacked

Methodology In this section, we first introduce a detection method based on outlier detection to identify if the given audio sample has been attacked. Second, we design our adaptive attack, which aims to achieve a successful attack while preserving perceptual quality. Meanwhile, we describe the adaptive optimization process for the three attack types. Det...
[6]

For the indices not included inmsg dif f, we assign the original watermark message probabilitiesp w to the corresponding target watermark message probabilitiesp t

In this case, msgdif f = [1,3]. For the indices not included inmsg dif f, we assign the original watermark message probabilitiesp w to the corresponding target watermark message probabilitiesp t. That is, for indices[0,2,4,5], thep t is equal to thep w. This optimization has two advantages: (1) It directs the gra- dient to focus more on the indices where ...
[7]

Watermark

Evaluation 5.1. Experimental Setup Datasets.We use three public datasets for our experiments. The first dataset is the LibriSpeech [36]. We select the small-sized subset, which has 6.3G audios, and covers 100.6 hours of audio data spoken by 251 speakers. The second dataset is obtained from AudioMarkData [8], which is built based on the Common V oice datas...

work page arXiv 2085
[8]

We show that defenders can leverage this statis- tical regularity to distinguish benign from attacked signals

Conclusion In this work, we analyze watermark decoder outputs and ob- serve that they exhibit approximately normal distributions on benign audio. We show that defenders can leverage this statis- tical regularity to distinguish benign from attacked signals. To demonstrate the fragility of such defenses, we introduce AWM, an adaptive audio watermark attack....
[9]

National Science Foundation under Grant CNS-2310207, CNS-2520900, CNS- 2451168

Acknowledgments This work was supported in part by the U.S. National Science Foundation under Grant CNS-2310207, CNS-2520900, CNS- 2451168
[10]

All tech- nical contents were conceived, conducted, and verified by the authors, which include the methodology, experiments, analysis, etc

Generative AI Use Disclosure The authors used generative AI tools such as ChatGPT solely for grammar improvement and language polishing. All tech- nical contents were conceived, conducted, and verified by the authors, which include the methodology, experiments, analysis, etc. The authors take full responsibility for the content of the publication
[11]

(2023, May) Beware the artificial impostor

McAfee. (2023, May) Beware the artificial impostor. a mcafee cybersecurity artificial intelligence report. [On- line]. Available: https://www.mcafee.com/content/dam/ consumer/en-us/resources/cybersecurity/artificial-intelligence/ rp-beware-the-artificial-impostor-report.pdf

2023
[12]

Copyright protection in generative ai: A technical perspective,

J. Ren, H. Xu, P. He, Y . Cui, S. Zeng, J. Zhang, H. Wen, J. Ding, P. Huang, L. Lyuet al., “Copyright protection in generative ai: A technical perspective,”arXiv preprint arXiv:2402.02333, 2024

work page arXiv 2024
[13]

Dear: A deep-learning-based audio re-recording resilient watermarking,

C. Liu, J. Zhang, H. Fang, Z. Ma, W. Zhang, and N. Yu, “Dear: A deep-learning-based audio re-recording resilient watermarking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp. 13 201–13 209

2023
[14]

De- tecting voice cloning attacks via timbre watermarking,

C. Liu, J. Zhang, T. Zhang, X. Yang, W. Zhang, and N. Yu, “De- tecting voice cloning attacks via timbre watermarking,” inNet- work and Distributed System Security Symposium, 2024

2024
[15]

Proactive detection of voice cloning with localized watermarking,

R. San Roman, P. Fernandez, H. Elsahar, A. D´efossez, T. Furon, and T. Tran, “Proactive detection of voice cloning with localized watermarking,”ICML, 2024

2024
[16]

Wm- codec: End-to-end neural speech codec with deep watermarking for authenticity verification,

J. Zhou, J. Yi, Y . Ren, J. Tao, T. Wang, and C. Y . Zhang, “Wm- codec: End-to-end neural speech codec with deep watermarking for authenticity verification,” inICASSP 2025-2025 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025
[17]

Sok: How robust is audio watermarking in generative ai models?

Y . Wen, A. Innuganti, A. B. Ramos, H. Guo, and Q. Yan, “Sok: How robust is audio watermarking in generative ai models?” arXiv preprint arXiv:2503.19176, 2025

work page arXiv 2025
[18]

Audiomark- bench: Benchmarking robustness of audio watermarking,

H. Liu, M. Guo, Z. Jiang, L. Wang, and N. Gong, “Audiomark- bench: Benchmarking robustness of audio watermarking,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 52 241–52 265, 2024

2024
[19]

Can simple averag- ing defeat modern watermarks?

P. Yang, H. Ci, Y . Song, and M. Z. Shou, “Can simple averag- ing defeat modern watermarks?”Advances in Neural Information Processing Systems, vol. 37, pp. 56 644–56 673, 2024

2024
[20]

Sok: How robust is image classification deep neural network watermarking?

N. Lukas, E. Jiang, X. Li, and F. Kerschbaum, “Sok: How robust is image classification deep neural network watermarking?” in 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 787–804

2022
[21]

{ASSET}: Ro- bust backdoor data detection across a multiplicity of deep learn- ing paradigms,

M. Pan, Y . Zeng, L. Lyu, X. Lin, and R. Jia, “{ASSET}: Ro- bust backdoor data detection across a multiplicity of deep learn- ing paradigms,” in32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 2725–2742

2023
[22]

Find- ing needles in a haystack: A black-box approach to invisible wa- termark detection,

M. Pan, Z. Wang, X. Dong, V . Sehwag, L. Lyu, and X. Lin, “Find- ing needles in a haystack: A black-box approach to invisible wa- termark detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 253–270

2024
[23]

Twenty years of digital audio watermarking—a comprehensive review,

G. Hua, J. Huang, Y . Q. Shi, J. Goh, and V . L. Thing, “Twenty years of digital audio watermarking—a comprehensive review,” Signal processing, vol. 128, pp. 222–242, 2016

2016
[24]

Wav- mark: Watermarking for audio generation,

G. Chen, Y . Wu, S. Liu, T. Liu, X. Du, and F. Wei, “Wav- mark: Watermarking for audio generation,”arXiv preprint arXiv:2308.12770, 2023

work page arXiv 2023
[25]

Audio codec augmentation for robust col- laborative watermarking of speech synthesis,

L. Juvela and X. Wang, “Audio codec augmentation for robust col- laborative watermarking of speech synthesis,” inICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025
[26]

Silent- cipher: Deep audio watermarking

M. K. Singh, N. Takahashi, W.-H. Liao, and Y . Mitsufuji, “Silent- cipher: Deep audio watermarking.” inINTERSPEECH, 2024

2024
[27]

High Fidelity Neural Audio Compression

A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,”arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

One-shot voice conversion by separating speaker and content representations with instance normalization,

J.-c. Chou, C.-c. Yeh, and H.-y. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,”arXiv preprint arXiv:1904.05742, 2019

work page arXiv 1904
[29]

Fragmentvc: Any-to-any voice conversion by end-to-end extract- ing and fusing fine-grained voice fragments with attention,

Y . Y . Lin, C.-M. Chien, J.-H. Lin, H.-y. Lee, and L.-s. Lee, “Fragmentvc: Any-to-any voice conversion by end-to-end extract- ing and fusing fine-grained voice fragments with attention,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5939– 5943

2021
[30]

Natural tts synthesis by conditioning wavenet on mel spectrogram pre- dictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerrv-Ryanet al., “Natural tts synthesis by conditioning wavenet on mel spectrogram pre- dictions,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 4779– 4783

2018
[31]

Fastspeech 2: Fast and high-quality end-to-end text to speech,

Y . Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y . Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020

work page arXiv 2006
[32]

Hopskipjumpattack: A query-efficient decision-based attack,

J. Chen, M. I. Jordan, and M. J. Wainwright, “Hopskipjumpattack: A query-efficient decision-based attack,” in2020 ieee symposium on security and privacy (sp). IEEE, 2020, pp. 1277–1294

2020
[33]

Square attack: a query-efficient black-box adversarial attack via random search,

M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein, “Square attack: a query-efficient black-box adversarial attack via random search,” inEuropean conference on computer vision. Springer, 2020, pp. 484–501

2020
[34]

Towards evaluating the robustness of neural networks,

N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in2017 ieee symposium on security and privacy (sp). Ieee, 2017, pp. 39–57

2017
[35]

Towards Deep Learning Models Resistant to Adversarial Attacks

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

Evading watermark based de- tection of ai-generated content,

Z. Jiang, J. Zhang, and N. Z. Gong, “Evading watermark based de- tection of ai-generated content,” inProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, 2023, pp. 1168–1181

2023
[37]

Clap learning audio concepts from natural language supervision,

B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023
[38]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

2023
[39]

A re- view on outlier/anomaly detection in time series data,

A. Bl ´azquez-Garc´ıa, A. Conde, U. Mori, and J. A. Lozano, “A re- view on outlier/anomaly detection in time series data,”ACM com- puting surveys (CSUR), vol. 54, no. 3, pp. 1–33, 2021

2021
[40]

TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis

H. Wu, T. Hu, Y . Liu, H. Zhou, J. Wang, and M. Long, “Timesnet: Temporal 2d-variation modeling for general time series analysis,” arXiv preprint arXiv:2210.02186, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

Dire for diffusion-generated image detection,

Z. Wang, J. Bao, W. Zhou, W. Wang, H. Hu, H. Chen, and H. Li, “Dire for diffusion-generated image detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 445–22 455

2023
[42]

Organic or diffused: Can we dis- tinguish human art from ai-generated images?

A. Y . J. Ha, J. Passananti, R. Bhaskar, S. Shan, R. Southen, H. Zheng, and B. Y . Zhao, “Organic or diffused: Can we dis- tinguish human art from ai-generated images?” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Commu- nications Security, 2024, pp. 4822–4836

2024
[43]

Detect- ing music deepfakes is easy but actually hard,

D. Afchar, G. Meseguer-Brocal, and R. Hennequin, “Detect- ing music deepfakes is easy but actually hard,”arXiv preprint arXiv:2405.04181, 2024

work page arXiv 2024
[44]

Singfake: Singing voice deepfake detection,

Y . Zang, Y . Zhang, M. Heydari, and Z. Duan, “Singfake: Singing voice deepfake detection,” inICASSP 2024-2024 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 156–12 160

2024
[45]

Conjugate bayesian analysis of the gaussian dis- tribution,

K. P. Murphy, “Conjugate bayesian analysis of the gaussian dis- tribution,”def, vol. 1, no. 2σ2, p. 16, 2007

2007
[46]

Lib- rispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210

2015
[47]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,”arXiv preprint arXiv:1912.06670, 2019

work page arXiv 1912
[48]

Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,

G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhanget al., “Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” arXiv preprint arXiv:2106.06909, 2021

work page arXiv 2021
[49]

Visqol: an objective speech quality model,

A. Hines, J. Skoglund, A. C. Kokaram, and N. Harte, “Visqol: an objective speech quality model,”EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015, pp. 1–18, 2015

2015

[1] [1]

Learning to Evade: Adaptive Attacks on Audio Watermarking

Introduction In recent years, the rapid growth of social networking plat- forms has encouraged many users to publicly share their audio content, including original works such as audiobooks and self- produced music. These audio contents might bring them in- come. However, many unauthorized users copy creative works, modify them, and re-upload them to mains...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

These schemes follow the architecture of the Encoder- Distortion-Decoder

Related Work Deep Learning-Based Audio Watermarking.Unlike tradi- tional schemes [13], which rely on predefined transformations, deep-learning-based schemes can learn complex feature repre- sentations and optimize watermarking dynamically [4, 5, 14, 15, 16]. These schemes follow the architecture of the Encoder- Distortion-Decoder. The encoder embeds the w...

[3] [3]

Preliminary Adding perturbations to the audio is a common strategy for at- tacks against watermarking systems

Background 3.1. Preliminary Adding perturbations to the audio is a common strategy for at- tacks against watermarking systems. The core idea is to either destroy the original watermark or forge a new one by introduc- ing perturbations that deceive the watermark decoder. Audio Watermark Decoder .The audio watermark decoder Dec(·)takes the encoded audio as ...

[4] [4]

The defenders are the attack detectors, who identify whether audio has been tampered with

they are aware that the decoded message probabilities output by the watermark decoder follow a normal distribution, but they do not know the corresponding mean and standard deviation. The defenders are the attack detectors, who identify whether audio has been tampered with. We assume that: 1) defenders have access to a large number of ground-truth audio s...

[5] [5]

attacked

Methodology In this section, we first introduce a detection method based on outlier detection to identify if the given audio sample has been attacked. Second, we design our adaptive attack, which aims to achieve a successful attack while preserving perceptual quality. Meanwhile, we describe the adaptive optimization process for the three attack types. Det...

[6] [6]

For the indices not included inmsg dif f, we assign the original watermark message probabilitiesp w to the corresponding target watermark message probabilitiesp t

In this case, msgdif f = [1,3]. For the indices not included inmsg dif f, we assign the original watermark message probabilitiesp w to the corresponding target watermark message probabilitiesp t. That is, for indices[0,2,4,5], thep t is equal to thep w. This optimization has two advantages: (1) It directs the gra- dient to focus more on the indices where ...

[7] [7]

Watermark

Evaluation 5.1. Experimental Setup Datasets.We use three public datasets for our experiments. The first dataset is the LibriSpeech [36]. We select the small-sized subset, which has 6.3G audios, and covers 100.6 hours of audio data spoken by 251 speakers. The second dataset is obtained from AudioMarkData [8], which is built based on the Common V oice datas...

work page arXiv 2085

[8] [8]

We show that defenders can leverage this statis- tical regularity to distinguish benign from attacked signals

Conclusion In this work, we analyze watermark decoder outputs and ob- serve that they exhibit approximately normal distributions on benign audio. We show that defenders can leverage this statis- tical regularity to distinguish benign from attacked signals. To demonstrate the fragility of such defenses, we introduce AWM, an adaptive audio watermark attack....

[9] [9]

National Science Foundation under Grant CNS-2310207, CNS-2520900, CNS- 2451168

Acknowledgments This work was supported in part by the U.S. National Science Foundation under Grant CNS-2310207, CNS-2520900, CNS- 2451168

[10] [10]

All tech- nical contents were conceived, conducted, and verified by the authors, which include the methodology, experiments, analysis, etc

Generative AI Use Disclosure The authors used generative AI tools such as ChatGPT solely for grammar improvement and language polishing. All tech- nical contents were conceived, conducted, and verified by the authors, which include the methodology, experiments, analysis, etc. The authors take full responsibility for the content of the publication

[11] [11]

(2023, May) Beware the artificial impostor

McAfee. (2023, May) Beware the artificial impostor. a mcafee cybersecurity artificial intelligence report. [On- line]. Available: https://www.mcafee.com/content/dam/ consumer/en-us/resources/cybersecurity/artificial-intelligence/ rp-beware-the-artificial-impostor-report.pdf

2023

[12] [12]

Copyright protection in generative ai: A technical perspective,

J. Ren, H. Xu, P. He, Y . Cui, S. Zeng, J. Zhang, H. Wen, J. Ding, P. Huang, L. Lyuet al., “Copyright protection in generative ai: A technical perspective,”arXiv preprint arXiv:2402.02333, 2024

work page arXiv 2024

[13] [13]

Dear: A deep-learning-based audio re-recording resilient watermarking,

C. Liu, J. Zhang, H. Fang, Z. Ma, W. Zhang, and N. Yu, “Dear: A deep-learning-based audio re-recording resilient watermarking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp. 13 201–13 209

2023

[14] [14]

De- tecting voice cloning attacks via timbre watermarking,

C. Liu, J. Zhang, T. Zhang, X. Yang, W. Zhang, and N. Yu, “De- tecting voice cloning attacks via timbre watermarking,” inNet- work and Distributed System Security Symposium, 2024

2024

[15] [15]

Proactive detection of voice cloning with localized watermarking,

R. San Roman, P. Fernandez, H. Elsahar, A. D´efossez, T. Furon, and T. Tran, “Proactive detection of voice cloning with localized watermarking,”ICML, 2024

2024

[16] [16]

Wm- codec: End-to-end neural speech codec with deep watermarking for authenticity verification,

J. Zhou, J. Yi, Y . Ren, J. Tao, T. Wang, and C. Y . Zhang, “Wm- codec: End-to-end neural speech codec with deep watermarking for authenticity verification,” inICASSP 2025-2025 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025

[17] [17]

Sok: How robust is audio watermarking in generative ai models?

Y . Wen, A. Innuganti, A. B. Ramos, H. Guo, and Q. Yan, “Sok: How robust is audio watermarking in generative ai models?” arXiv preprint arXiv:2503.19176, 2025

work page arXiv 2025

[18] [18]

Audiomark- bench: Benchmarking robustness of audio watermarking,

H. Liu, M. Guo, Z. Jiang, L. Wang, and N. Gong, “Audiomark- bench: Benchmarking robustness of audio watermarking,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 52 241–52 265, 2024

2024

[19] [19]

Can simple averag- ing defeat modern watermarks?

P. Yang, H. Ci, Y . Song, and M. Z. Shou, “Can simple averag- ing defeat modern watermarks?”Advances in Neural Information Processing Systems, vol. 37, pp. 56 644–56 673, 2024

2024

[20] [20]

Sok: How robust is image classification deep neural network watermarking?

N. Lukas, E. Jiang, X. Li, and F. Kerschbaum, “Sok: How robust is image classification deep neural network watermarking?” in 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 787–804

2022

[21] [21]

{ASSET}: Ro- bust backdoor data detection across a multiplicity of deep learn- ing paradigms,

M. Pan, Y . Zeng, L. Lyu, X. Lin, and R. Jia, “{ASSET}: Ro- bust backdoor data detection across a multiplicity of deep learn- ing paradigms,” in32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 2725–2742

2023

[22] [22]

Find- ing needles in a haystack: A black-box approach to invisible wa- termark detection,

M. Pan, Z. Wang, X. Dong, V . Sehwag, L. Lyu, and X. Lin, “Find- ing needles in a haystack: A black-box approach to invisible wa- termark detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 253–270

2024

[23] [23]

Twenty years of digital audio watermarking—a comprehensive review,

G. Hua, J. Huang, Y . Q. Shi, J. Goh, and V . L. Thing, “Twenty years of digital audio watermarking—a comprehensive review,” Signal processing, vol. 128, pp. 222–242, 2016

2016

[24] [24]

Wav- mark: Watermarking for audio generation,

G. Chen, Y . Wu, S. Liu, T. Liu, X. Du, and F. Wei, “Wav- mark: Watermarking for audio generation,”arXiv preprint arXiv:2308.12770, 2023

work page arXiv 2023

[25] [25]

Audio codec augmentation for robust col- laborative watermarking of speech synthesis,

L. Juvela and X. Wang, “Audio codec augmentation for robust col- laborative watermarking of speech synthesis,” inICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025

[26] [26]

Silent- cipher: Deep audio watermarking

M. K. Singh, N. Takahashi, W.-H. Liao, and Y . Mitsufuji, “Silent- cipher: Deep audio watermarking.” inINTERSPEECH, 2024

2024

[27] [27]

High Fidelity Neural Audio Compression

A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,”arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

One-shot voice conversion by separating speaker and content representations with instance normalization,

J.-c. Chou, C.-c. Yeh, and H.-y. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,”arXiv preprint arXiv:1904.05742, 2019

work page arXiv 1904

[29] [29]

Fragmentvc: Any-to-any voice conversion by end-to-end extract- ing and fusing fine-grained voice fragments with attention,

Y . Y . Lin, C.-M. Chien, J.-H. Lin, H.-y. Lee, and L.-s. Lee, “Fragmentvc: Any-to-any voice conversion by end-to-end extract- ing and fusing fine-grained voice fragments with attention,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5939– 5943

2021

[30] [30]

Natural tts synthesis by conditioning wavenet on mel spectrogram pre- dictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerrv-Ryanet al., “Natural tts synthesis by conditioning wavenet on mel spectrogram pre- dictions,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 4779– 4783

2018

[31] [31]

Fastspeech 2: Fast and high-quality end-to-end text to speech,

Y . Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y . Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020

work page arXiv 2006

[32] [32]

Hopskipjumpattack: A query-efficient decision-based attack,

J. Chen, M. I. Jordan, and M. J. Wainwright, “Hopskipjumpattack: A query-efficient decision-based attack,” in2020 ieee symposium on security and privacy (sp). IEEE, 2020, pp. 1277–1294

2020

[33] [33]

Square attack: a query-efficient black-box adversarial attack via random search,

M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein, “Square attack: a query-efficient black-box adversarial attack via random search,” inEuropean conference on computer vision. Springer, 2020, pp. 484–501

2020

[34] [34]

Towards evaluating the robustness of neural networks,

N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in2017 ieee symposium on security and privacy (sp). Ieee, 2017, pp. 39–57

2017

[35] [35]

Towards Deep Learning Models Resistant to Adversarial Attacks

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[36] [36]

Evading watermark based de- tection of ai-generated content,

Z. Jiang, J. Zhang, and N. Z. Gong, “Evading watermark based de- tection of ai-generated content,” inProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, 2023, pp. 1168–1181

2023

[37] [37]

Clap learning audio concepts from natural language supervision,

B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023

[38] [38]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

2023

[39] [39]

A re- view on outlier/anomaly detection in time series data,

A. Bl ´azquez-Garc´ıa, A. Conde, U. Mori, and J. A. Lozano, “A re- view on outlier/anomaly detection in time series data,”ACM com- puting surveys (CSUR), vol. 54, no. 3, pp. 1–33, 2021

2021

[40] [40]

TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis

H. Wu, T. Hu, Y . Liu, H. Zhou, J. Wang, and M. Long, “Timesnet: Temporal 2d-variation modeling for general time series analysis,” arXiv preprint arXiv:2210.02186, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[41] [41]

Dire for diffusion-generated image detection,

Z. Wang, J. Bao, W. Zhou, W. Wang, H. Hu, H. Chen, and H. Li, “Dire for diffusion-generated image detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 445–22 455

2023

[42] [42]

Organic or diffused: Can we dis- tinguish human art from ai-generated images?

A. Y . J. Ha, J. Passananti, R. Bhaskar, S. Shan, R. Southen, H. Zheng, and B. Y . Zhao, “Organic or diffused: Can we dis- tinguish human art from ai-generated images?” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Commu- nications Security, 2024, pp. 4822–4836

2024

[43] [43]

Detect- ing music deepfakes is easy but actually hard,

D. Afchar, G. Meseguer-Brocal, and R. Hennequin, “Detect- ing music deepfakes is easy but actually hard,”arXiv preprint arXiv:2405.04181, 2024

work page arXiv 2024

[44] [44]

Singfake: Singing voice deepfake detection,

Y . Zang, Y . Zhang, M. Heydari, and Z. Duan, “Singfake: Singing voice deepfake detection,” inICASSP 2024-2024 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 156–12 160

2024

[45] [45]

Conjugate bayesian analysis of the gaussian dis- tribution,

K. P. Murphy, “Conjugate bayesian analysis of the gaussian dis- tribution,”def, vol. 1, no. 2σ2, p. 16, 2007

2007

[46] [46]

Lib- rispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210

2015

[47] [47]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,”arXiv preprint arXiv:1912.06670, 2019

work page arXiv 1912

[48] [48]

Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,

G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhanget al., “Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” arXiv preprint arXiv:2106.06909, 2021

work page arXiv 2021

[49] [49]

Visqol: an objective speech quality model,

A. Hines, J. Skoglund, A. C. Kokaram, and N. Harte, “Visqol: an objective speech quality model,”EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015, pp. 1–18, 2015

2015