pith. sign in

arxiv: 2605.30614 · v1 · pith:BIOJT5N2new · submitted 2026-05-28 · 💻 cs.CR · cs.SD

Audio Pirates: Black-box Audio Watermark Removal via Diffusion Priors

Pith reviewed 2026-06-29 06:17 UTC · model grok-4.3

classification 💻 cs.CR cs.SD
keywords audio watermarkingdiffusion modelsblack-box attackwatermark removaladversarial robustnessAI-generated audio
0
0 comments X

The pith

A black-box diffusion method removes inaudible audio watermarks by regenerating from intermediate noise without knowing the scheme.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that audio watermarks meant to detect misuse of AI-generated audio can be stripped away using DiffErase, a method that needs no information about how the watermark was added. It adds an intermediate amount of diffusion noise to the watermarked audio and then regenerates the signal with a pretrained denoising model, which removes the watermark while leaving the sound perceptually unchanged. A reader would care because this shows existing watermarking may fail to protect intellectual property or flag synthetic audio if attackers use similar techniques. The results point to the need for watermark designs that survive diffusion-based regeneration.

Core claim

Theoretical analysis and extensive experiments demonstrate that inaudible audio watermarks are highly vulnerable: across multiple audio domains, DiffErase consistently removes watermarks while preserving perceptual quality by perturbing watermarked audio to an intermediate diffusion noise level and regenerating it using a pretrained denoising model.

What carries the argument

DiffErase, the process of perturbing watermarked audio to an intermediate diffusion noise level and regenerating with a pretrained denoising model to suppress watermark signals in a black-box setting.

If this is right

  • Watermarks can be removed consistently while maintaining audio quality.
  • The attack succeeds without any knowledge of or access to the watermarking scheme.
  • The vulnerability holds across multiple audio domains.
  • Future audio watermarking designs must account for diffusion-based removal threats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same regeneration approach might be tested on watermarking for images or video.
  • Watermark developers could add tests against diffusion model regeneration during design.
  • Watermarks that embed signals resistant to denoising steps might evade this type of attack.

Load-bearing premise

That perturbing watermarked audio to an intermediate diffusion noise level and regenerating with a pretrained denoising model will reliably suppress the watermark signal without access to the watermarking scheme or degrading perceptual quality.

What would settle it

An experiment in which DiffErase either leaves watermarks detectable at high rates or produces audio with clearly degraded perceptual quality would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.30614 by Aohan Li, Chenpei Huang, Hanqing Guo, Jiang Liu, Lingfeng Yao, Miao Pan, Tomoaki Ohtsuki, Xincong Zhong, Xuandong Zhao.

Figure 1
Figure 1. Figure 1: Overview of DiffErase. Watermarked audio is converted to a mel-spectrogram via STFT, perturbed to an intermediate noise level t ∗ via forward noising, and then denoised back to t = 0. A vocoder reconstructs the attacked audio, which evades watermark detection. The noise level t ∗ controls the trade-off between removal strength and perceptual fidelity. dio requires careful design choices, as audio admits mu… view at source ↗
Figure 2
Figure 2. Figure 2: Effect of noise level t ∗ for DIFFERASE-MEL. Trade-off between audio quality (ViSQOL, left axis) and watermark removal (FNR = 1 − TPR, right axis), evaluated on Perth. Left: Speech. Middle: Music. Right: Environment. variants suppress watermark detection to TPR = 0 across all tested watermarking schemes while preserving percep￾tual quality. At noise level t ∗ = 0.1, DIFFERASE-LATENT achieves SQUIM-MOS of 4… view at source ↗
Figure 3
Figure 3. Figure 3: Results are reported for DIFFERASE-MEL; results of DIFFERASE-LATENT are provided in Appendix C.2 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Spectrogram visualization. Top: original audio. Middle: watermarked audio. Bottom: after DIFFERASE-MEL (t ∗ = 0.1). Watermark patterns (middle row) are attenuated while acoustic content is preserved. residual watermark evidence, while larger values improve removal at the cost of additional smoothing [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of noise level t ∗ for DIFFERASE-LATENT. Trade-off between audio quality (ViSQOL, left axis) and watermark removal (FNR = 1 − TPR, right axis), evaluated on Perth. Left: Speech. Middle: Music. Right: Environment [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mel-spectrogram visualization across different attack representations. DiffWave (c) over-smooths harmonic details. WavePuri￾fier (d) introduces audible artifacts. DiffErase (e) preserves the spectral structure while removing watermark perturbations. Watermarked (a) Watermarked DiffWave Watermarked (b) DiffWave WavePurifier Watermarked (c) WavePurifier DiffErase Watermarked (d) DiffErase [PITH_FULL_IMAGE:f… view at source ↗
Figure 7
Figure 7. Figure 7: Waveform visualization across different attack representations. Red waveforms show reconstructed audio, and gray shows input watermarked audio. DiffErase (d) closely matches the original temporal envelope. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

With the rise of AI-generated audio, watermarking has become widely used for detecting misuse and protecting intellectual property. However, adversaries may try to remove these watermarks, making it critical to evaluate how well watermarking schemes withstand removal attacks. Existing attacks are often impractical: they either noticeably degrade perceptual quality or require access to the watermarking scheme. We propose DiffErase, a black-box watermark removal attack that assumes no knowledge of the target watermarking scheme while maintaining perceptual quality. DiffErase perturbs watermarked audio to an intermediate diffusion noise level and regenerates it using a pretrained denoising model, effectively suppressing watermark signals. Theoretical analysis and extensive experiments demonstrate that inaudible audio watermarks are highly vulnerable: across multiple audio domains, DiffErase consistently removes watermarks while preserving perceptual quality. These findings highlight the need for future audio watermarking designs to consider diffusion-based threats. Code and demos are available at https://differase.github.io/DiffErase/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes DiffErase, a black-box watermark removal attack on audio signals. It perturbs watermarked audio to an intermediate diffusion timestep and regenerates via a pretrained denoising model to suppress the watermark signal without access to the watermarking scheme or noticeable quality loss. Theoretical analysis and experiments across audio domains are claimed to show consistent watermark removal while preserving perceptual quality, highlighting vulnerabilities in existing watermarking designs.

Significance. If the central claim holds, the work identifies a practical diffusion-based threat to audio watermarking that is more general than prior attacks, which could drive more robust watermark designs for AI-generated content. The public release of code and demos is a positive factor for reproducibility.

major comments (2)
  1. [Method] Method section (diffusion perturbation step): the selection of the intermediate timestep t is not shown to be independent of the watermarking scheme or domain. If t is chosen via validation on the evaluated watermarks (or per-domain heuristics), the black-box guarantee fails for unseen schemes, directly undermining the claim that DiffErase 'consistently removes watermarks while preserving perceptual quality' without scheme access.
  2. [Experiments] Experiments (quantitative results): the abstract and claimed analysis assert support via metrics such as PESQ/STOI, but no error bars, dataset sizes, or statistical significance tests are referenced in the provided description; without these, the 'extensive experiments' do not yet establish robustness across domains.
minor comments (1)
  1. [Abstract] Abstract: no quantitative metrics or dataset details are provided despite claiming 'extensive experiments'; adding a sentence with key numbers (e.g., average detection rate drop, PESQ scores) would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the two major comments point by point below. Where revisions are warranted we indicate them explicitly; we believe the core claims remain supported by the theoretical analysis and cross-domain results.

read point-by-point responses
  1. Referee: [Method] Method section (diffusion perturbation step): the selection of the intermediate timestep t is not shown to be independent of the watermarking scheme or domain. If t is chosen via validation on the evaluated watermarks (or per-domain heuristics), the black-box guarantee fails for unseen schemes, directly undermining the claim that DiffErase 'consistently removes watermarks while preserving perceptual quality' without scheme access.

    Authors: The timestep t is determined from the diffusion prior itself: we select the noise level at which the theoretical analysis shows watermark signals are attenuated while semantic content remains recoverable by the pretrained denoiser. This choice is fixed once the diffusion model is chosen and does not involve per-watermark validation or domain-specific heuristics. Experiments on multiple unseen watermarking schemes and audio domains were performed after fixing t, supporting the black-box claim. To make this explicit we will add a dedicated paragraph in the method section describing the selection criterion and its independence from any watermarking scheme. revision: partial

  2. Referee: [Experiments] Experiments (quantitative results): the abstract and claimed analysis assert support via metrics such as PESQ/STOI, but no error bars, dataset sizes, or statistical significance tests are referenced in the provided description; without these, the 'extensive experiments' do not yet establish robustness across domains.

    Authors: We agree that reporting error bars, exact dataset sizes per domain, and statistical significance tests would strengthen the presentation. In the revised manuscript we will include these details: standard deviations across runs, the number of audio clips evaluated in each domain, and p-values from paired statistical tests comparing watermarked and DiffErase-processed outputs. revision: yes

Circularity Check

0 steps flagged

No circularity; method relies on external pretrained models

full rationale

The paper describes an empirical black-box attack (DiffErase) that perturbs audio to an intermediate diffusion timestep and denoises with a pretrained model. No equations, derivations, or first-principles results are presented that reduce the claimed outcome to fitted parameters, self-definitions, or self-citation chains. The central claim rests on external pretrained diffusion models and experimental validation across domains, with no load-bearing self-referential steps or predictions that are forced by construction. The theoretical analysis is invoked but not shown to collapse into the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard capability of pretrained diffusion models rather than new fitted parameters or invented entities.

axioms (1)
  • domain assumption Pretrained diffusion denoising models trained on clean audio can regenerate natural-sounding audio from intermediate noise levels
    The regeneration step assumes the model was trained on distributions that exclude watermark signals.

pith-pipeline@v0.9.1-grok · 5718 in / 1119 out tokens · 22614 ms · 2026-06-29T06:17:36.225270+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 17 canonical work pages · 7 internal anchors

  1. [1]

    Wavmark: Watermarking for audio generation.arXiv preprint arXiv:2308.12770,

    Chen, G., Wu, Y ., Liu, S., Liu, T., Du, X., and Wei, F. Wavmark: Watermarking for audio generation.arXiv preprint arXiv:2308.12770,

  2. [2]

    FMA: A Dataset For Music Analysis

    Defferrard, M., Benzi, K., Vandergheynst, P., and Bresson, X. Fma: A dataset for music analysis.arXiv preprint arXiv:1612.01840,

  3. [3]

    High Fidelity Neural Audio Compression

    D´efossez, A., Copet, J., Synnaeve, G., and Adi, Y . High fidelity neural audio compression.arXiv preprint arXiv:2210.13438,

  4. [4]

    Clotho: An audio captioning dataset

    Drossos, K., Lipping, S., and Virtanen, T. Clotho: An audio captioning dataset. InICASSP 2020-2020 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 736–740. IEEE,

  5. [5]

    Adversarial Spheres

    Gilmer, J., Metz, L., Faghri, F., Schoenholz, S. S., Raghu, M., Wattenberg, M., and Goodfellow, I. Adversarial spheres.arXiv preprint arXiv:1801.02774,

  6. [6]

    I., Zhang, L

    Huang, C., Yao, L., Lee, K. I., Zhang, L. E., Chen, X., and Pan, M. Echomark: Perceptual acoustic environ- ment transfer with watermark-embedded room impulse response.arXiv preprint arXiv:2511.06458,

  7. [7]

    DiffWave: A Versatile Diffusion Model for Audio Synthesis

    Kong, J., Kim, J., and Bae, J. Hifi-gan: Generative ad- versarial networks for efficient and high fidelity speech synthesis.Advances in neural information processing systems, 33:17022–17033, 2020a. Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020b. K...

  8. [8]

    G., Ping, W., Ginsburg, B., Catanzaro, B., and Yoon, S

    Lee, S. G., Ping, W., Ginsburg, B., Catanzaro, B., and Yoon, S. Bigvgan: A universal neural vocoder with large-scale training. In11th International Conference on Learning Representations, ICLR 2023,

  9. [9]

    HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal

    Li, K., Hu, X., Grishchenko, I., and Lie, D. Harmoni- cattack: An adaptive cross-domain audio watermark re- moval.arXiv preprint arXiv:2511.21577,

  10. [10]

    Ideaw: Robust neu- ral audio watermarking with invertible dual-embedding

    Li, P., Zhang, X., Xiao, J., and Wang, J. Ideaw: Robust neu- ral audio watermarking with invertible dual-embedding. arXiv preprint arXiv:2409.19627,

  11. [11]

    Dear: A deep-learning-based audio re-recording resilient watermarking

    Liu, C., Zhang, J., Fang, H., Ma, Z., Zhang, W., and Yu, N. Dear: A deep-learning-based audio re-recording resilient watermarking. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 13201–13209, 2023a. Liu, C., Zhang, J., Zhang, T., Yang, X., Zhang, W., and Yu, N. Detecting voice cloning attacks via timbre watermark- ing.arXiv...

  12. [12]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Meng, C., He, Y ., Song, Y ., Song, J., Wu, J., Zhu, J.-Y ., and Ermon, S. Sdedit: Guided image synthesis and edit- ing with stochastic differential equations.arXiv preprint arXiv:2108.01073,

  13. [13]

    Deep audio water- marks are shallow: Limitations of post-hoc watermarking techniques for speech.arXiv preprint arXiv:2504.10782,

    O’Reilly, P., Jin, Z., Su, J., and Pardo, B. Deep audio water- marks are shallow: Limitations of post-hoc watermarking techniques for speech.arXiv preprint arXiv:2504.10782,

  14. [14]

    com/resemble-ai/Perth

    URL https://github. com/resemble-ai/Perth. Roman, R. S., Fernandez, P., D ´efossez, A., Furon, T., Tran, T., and Elsahar, H. Proactive detection of voice cloning with localized watermarking.arXiv preprint arXiv:2401.17264,

  15. [15]

    K., Takahashi, N., Liao, W., and Mitsufuji, Y

    Singh, M. K., Takahashi, N., Liao, W., and Mitsufuji, Y . Silentcipher: Deep audio watermarking.arXiv preprint arXiv:2406.03822,

  16. [16]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,

  17. [17]

    B., Guo, H., and Yan, Q

    Wen, Y ., Innuganti, A., Ramos, A. B., Guo, H., and Yan, Q. Sok: How robust is audio watermarking in generative ai models?arXiv preprint arXiv:2503.19176,

  18. [18]

    Defend- ing against adversarial audio via diffusion model.arXiv preprint arXiv:2303.01507,

    Wu, S., Wang, J., Ping, W., Nie, W., and Xiao, C. Defend- ing against adversarial audio via diffusion model.arXiv preprint arXiv:2303.01507,

  19. [19]

    Fingerprinting deep neural networks for ownership protection: An ana- lytical approach

    Yang, G., Geng, Z., Chen, Y ., and Luo, C. Fingerprinting deep neural networks for ownership protection: An ana- lytical approach. InThe Fourteenth International Confer- ence on Learning Representations, 2026a. URL https: //openreview.net/forum?id=sg3UNWKVFt. Yang, G., Geng, Z., Chen, Y ., and Luo, C. Liteguard: Effi- cient task-agnostic model fingerprint...

  20. [20]

    Proofs in Section 4.3 A.1

    11 Black-box Audio Watermark Removal via Diffusion Priors A. Proofs in Section 4.3 A.1. Proof of Lemma 4.3 Lemma A.1(Score restores off-manifold deviations).Assume the manifold hypothesis and a local Gaussian approximation of pt, the score function points towards the high-density region. For a watermarked statext with off-manifold component Π⊥(xt)δ, there...

  21. [21]

    The reconstructed mel-spectrogram is converted back to waveform using HiFi-GAN (Kong et al., 2020a)

    Diffusion is then performed in the V AE latent space using a UNet with the following configuration: image size 64, base channels 128, 2 residual blocks per stage, channel multipliers [1,2,3,5] , and attention at resolutions {8,4,2} . The reconstructed mel-spectrogram is converted back to waveform using HiFi-GAN (Kong et al., 2020a). B.2. Baseline attacks ...

  22. [22]

    in the spectrogram domain following Liu et al. (2024a). For speech, we use a query budget of 10,000 and perturbation bound ϵ= 0.02 . For music and environmental sounds, we foundϵ= 0.02insufficient for successful attacks and increased it toϵ= 0.2. 14 Black-box Audio Watermark Removal via Diffusion Priors Table 4.Comparison with baselineson the music domain...

  23. [23]

    Noise level trade-off for DIFFERASE-LATENT Figure 5 shows the quality-removal trade-off for DIFFERASE-LATENT

    C.2. Noise level trade-off for DIFFERASE-LATENT Figure 5 shows the quality-removal trade-off for DIFFERASE-LATENT. The trends are similar: increasing t∗ improves watermark removal (higher FNR) while reducing audio quality (lower ViSQOL). On speech, Perth becomes undetectable at t∗ ≥0.05 . On music and environmental sounds, complete removal requires higher...