pith. sign in

arxiv: 2502.04230 · v3 · pith:KOF4TXN4new · submitted 2025-02-06 · 💻 cs.SD · cs.AI· cs.CR· cs.LG· eess.AS

XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

Pith reviewed 2026-05-25 08:28 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CRcs.LGeess.AS
keywords audio watermarkingcross-attentionrobust detectionattributiongenerative audiopsychoacoustic maskingparameter sharing
0
0 comments X

The pith

XATTNMARK uses cross-attention and partial parameter sharing to jointly optimize audio watermark detection and attribution under generative edits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces XATTNMARK to address the difficulty of making watermarks that are both easy to detect and easy to trace back to their source in generative audio. Prior methods improve one goal at the expense of the other, but this approach shares some parameters between the generator that embeds the signal and the detector that reads it, then adds a cross-attention step to pull out the embedded message efficiently. A temporal conditioning module spreads the message across time, while a new loss term based on time-frequency masking keeps the watermark from being audible. If the method works as described, it would let audio platforms and rights holders reliably identify and attribute content even after heavy editing or synthesis transformations.

Core claim

XATTNMARK bridges the gap between robust detection and accurate attribution by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, a temporal conditioning module for improved message distribution, and a psychoacoustic-aligned time-frequency masking loss that captures fine-grained auditory masking effects, achieving state-of-the-art performance in both detection and attribution with superior robustness against a wide range of audio transformations including challenging generative editing at varying strengths.

What carries the argument

Cross-attention mechanism for message retrieval combined with partial parameter sharing between generator and detector.

If this is right

  • XATTNMARK demonstrates superior robustness against generative editing at varying strengths.
  • The psychoacoustic-aligned TF masking loss improves watermark imperceptibility.
  • Partial parameter sharing enables joint optimization of detection and attribution.
  • The approach advances audio watermarking for protecting intellectual property and ensuring authenticity in generative AI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the cross-attention design proves stable, it could be adapted to watermark other time-series data such as video or sensor streams.
  • Deployment at scale would likely require additional checks against removal attacks that target the shared parameters specifically.
  • Integration into commercial audio generators might create a de-facto standard for provenance tracking without separate post-processing steps.

Load-bearing premise

The cross-attention mechanism together with partial parameter sharing will jointly optimize robust detection and accurate attribution without introducing new failure modes under real-world generative editing pipelines.

What would settle it

A controlled test in which audio is passed through strong generative editing models at multiple strength levels and either watermark detection rate or attribution accuracy falls below the levels reported for WavMark or AudioSeal.

Figures

Figures reproduced from arXiv: 2502.04230 by Andrea Fanelli, Jihui Jin, Lichao Sun, Lie Lu, Yixin Liu.

Figure 1
Figure 1. Figure 1: Quality-attribution performance trade-off curve across different watermarking strengths and the overall performance com￾parison on detection and attribution tasks. Higher values on both axes indicate better performance. Copet et al., 2024). While it democratizes the creative pro￾cess and enables new applications, it also brings serious concerns for copyrighted data misuse, data provenance and authenticity … view at source ↗
Figure 2
Figure 2. Figure 2: System Diagram for XATTNMARK. XATTNMARK consists of a watermark generator and a watermark detector, with a shared embedding table that facilitates message decoding through a cross-attention module. In the generator part, we first employ an encoder network to encode the audio latent and then apply a temporal modulation to hide the message. The modulated latent is then fed into a decoder to produce the water… view at source ↗
Figure 3
Figure 3. Figure 3: Attribution accuracy with different #Users. 0 20 40 60 80 100 Training Steps (k) 50 60 70 80 90 100 Message Bit Accuracy (%) Ours w/o Modulation w/o CrossAttn w/o Both [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: MUSHRA subjective listening test results comparing perceptual quality across different watermarking methods. Higher scores indicate better audio quality as rated by human listeners. Our method achieves quality scores comparable to AudioSeal. 0 10000 20000 30000 40000 50000 Training Steps 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy WavMark Detection Message 0 10000 20000 30000 40000 50000 Training Steps 0.5 0.6 0.7 0.… view at source ↗
Figure 6
Figure 6. Figure 6: Validation accuracy and quality curve of different methods over training steps. Using a blended architecture, XATTNMARK is able to achieve a better balance in terms of learning efficiency in detection, and message-bit decoding and watermark imperceptibility. Compared to our method, WavMark, the fully-shared architecture, suffers from robustness degradation in detection as the training progresses; the fully… view at source ↗
Figure 7
Figure 7. Figure 7: Detection and Attribution Accuracy of our method on augmented samples across different augmentation strengths. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The detection and attribution accuracy of XATTNMARK across different audio duration and watermark strength α on MusicCaps. The performance is averaged over all the standard audio editing. The attribution pool size is set to {100, 1000, 10000}. key insights: (1) Our method introduces watermarks with the lowest energy, making them less perceptible in both time and frequency domains. (2) The spectral patterns… view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of the original audio and watermark residuals across four different methods for a randomly selected sample from MusicCaps dataset, showing waveform (left), spectrogram (middle), and mel-spectrogram (right) representations. 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 Time Bin 0 5 10 15 Frequency Band (a) AudioSeal 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 Time Bin 0 5 10 15 (b) Ours 0.0 … view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of the per-tile masking penalty distribution. The blue regions indicate tiles assigned a lower penalty for embedding a watermark. Our weighting scheme provides more fine-grained control. The audio sample is randomly selected from MusicCap (Agostinelli et al., 2023), and the watermarked audio is generated from a model at the very beginning of training. 24 [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
read the original abstract

The rapid proliferation of generative audio synthesis and editing technologies has raised serious concerns about copyright infringement, data provenance, and the spread of misinformation via deepfake audio. Watermarking offers a proactive solution by embedding imperceptible yet identifiable and traceable signals into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to jointly optimize both robust detection and accurate attribution. This paper introduces Cross-Attention Robust Audio Watermark (XATTNMARK), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned time-frequency (TF) masking loss that captures fine-grained auditory masking effects, improving watermark imperceptibility. XATTNMARK achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing at varying strengths. This work advances audio watermarking for protecting intellectual property and ensuring authenticity in the era of generative AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces XAttnMark, a neural audio watermarking approach that uses partial parameter sharing between the generator and detector, a cross-attention mechanism for message retrieval, a temporal conditioning module, and a psychoacoustic-aligned time-frequency (TF) masking loss. It claims state-of-the-art performance on both detection and attribution tasks, with improved robustness to a range of audio transformations including generative editing at varying strengths, addressing limitations in prior methods such as WavMark and AudioSeal.

Significance. If the empirical claims hold, the work would advance audio watermarking by demonstrating joint optimization of detection and attribution under realistic generative edits, which is a practically relevant gap. The cross-attention design and TF masking loss represent targeted technical contributions that could improve message recoverability and imperceptibility.

major comments (2)
  1. [Abstract] Abstract: the central SOTA claim for joint detection+attribution robustness is stated without any quantitative results, baselines, datasets, ablations, or error bars, rendering it impossible to verify whether the cross-attention and partial-sharing design actually supports the performance assertions or whether post-hoc choices affect the outcome.
  2. [Abstract] Abstract: the assumption that cross-attention message retrieval together with partial generator-detector parameter sharing and temporal conditioning will avoid new failure modes under high-strength generative edits (e.g., diffusion-based or voice-conversion pipelines) is untested in the visible text; no ablation isolating the effect of shared parameters on attribution accuracy after such edits is provided, which is load-bearing for the robustness claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central SOTA claim for joint detection+attribution robustness is stated without any quantitative results, baselines, datasets, ablations, or error bars, rendering it impossible to verify whether the cross-attention and partial-sharing design actually supports the performance assertions or whether post-hoc choices affect the outcome.

    Authors: We agree that the abstract would benefit from including key quantitative results to make the SOTA claim more verifiable on its own. The full manuscript contains the supporting experiments with baselines (WavMark, AudioSeal), datasets, ablations, and error bars in the Experiments section. We will revise the abstract to incorporate specific metrics on detection and attribution robustness. revision: yes

  2. Referee: [Abstract] Abstract: the assumption that cross-attention message retrieval together with partial generator-detector parameter sharing and temporal conditioning will avoid new failure modes under high-strength generative edits (e.g., diffusion-based or voice-conversion pipelines) is untested in the visible text; no ablation isolating the effect of shared parameters on attribution accuracy after such edits is provided, which is load-bearing for the robustness claim.

    Authors: The manuscript reports evaluations of robustness under generative editing at varying strengths. We acknowledge, however, that an explicit ablation isolating the contribution of partial parameter sharing to attribution accuracy specifically after high-strength edits is not present. We will add this ablation to strengthen the evidence for the design choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML claims with no derivation chain or self-referential fitting

full rationale

The provided abstract and context describe an empirical neural audio watermarking model (cross-attention + partial parameter sharing + TF masking loss) whose central claims are SOTA detection/attribution numbers on transformed audio. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the text. Performance results are obtained by standard training/evaluation and do not reduce to inputs by construction. This is the normal non-circular case for a learning-based method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities are described in sufficient detail to populate the ledger.

pith-pipeline@v0.9.0 · 5745 in / 1224 out tokens · 18898 ms · 2026-05-25T08:28:32.400374+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech

    cs.SD 2026-05 unverdicted novelty 7.0

    MelShield adds keyed low-energy spread-spectrum perturbations to Mel-spectrograms inside TTS pipelines before vocoding to enable robust extraction of user-specific attribution signals even after compression or noise.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  3. [3]

    and Rothman, E

    Abbott, R. and Rothman, E. Disrupting creativity: Copyright law in the age of generative artificial intelligence. Fla. L. Rev., 75: 0 1141, 2023

  4. [4]

    H., Awumey, E., and Das, S

    Agnew, W., Barnett, J., Chu, A., Hong, R., Feffer, M., Netzorg, R., Jiang, H. H., Awumey, E., and Das, S. Sound check: Auditing audio datasets. arXiv preprint 2410.13114, 2024. URL http://arxiv.org/pdf/2410.13114v1

  5. [5]

    MusicLM: Generating Music From Text

    Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023

  6. [6]

    M., and Weber, G

    Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M., and Weber, G. Common voice: A massively-multilingual speech corpus. In LREC, 2020

  7. [7]

    F., and Pardo, B

    Barnett, J., Garcia, H. F., and Pardo, B. Exploring musical roots: Applying audio embeddings to empower influence attribution for a generative music model. arXiv preprint 2401.14542, 2024. URL http://arxiv.org/pdf/2401.14542v1

  8. [8]

    Hello me, meet the real me: Audio deepfake attacks on voice assistants

    Bilika, D., Michopoulou, N., Alepis, E., and Patsakis, C. Hello me, meet the real me: Audio deepfake attacks on voice assistants. arXiv preprint 2302.10328, 2023. URL http://arxiv.org/pdf/2302.10328v1

  9. [9]

    Iso/iec mpeg-2 advanced audio coding

    Bosi, M., Brandenburg, K., Quackenbush, S., Fielder, L., Akagiri, K., Fuchs, H., and Dietz, M. Iso/iec mpeg-2 advanced audio coding. Journal of the Audio engineering society, 45 0 (10): 0 789--814, 1997

  10. [10]

    violation of my body:

    Brigham, N. G., Wei, M., Kohno, T., and Redmiles, E. M. "violation of my body:" perceptions of ai-generated non-consensual (intimate) imagery. arXiv preprint 2406.05520, 2024. URL http://arxiv.org/pdf/2406.05520v2

  11. [11]

    Buo, S. A. The emerging threats of deepfake attacks and countermeasures. arXiv preprint 2012.07989, 2020. URL http://arxiv.org/pdf/2012.07989v1

  12. [12]

    Wavmark: Watermarking for audio generation

    Chen, G., Wu, Y., Liu, S., Liu, T., Du, X., and Wei, F. Wavmark: Watermarking for audio generation. arXiv preprint arXiv:2308.12770, 2023

  13. [13]

    I., and Wainwright, M

    Chen, J., Jordan, M. I., and Wainwright, M. J. Hopskipjumpattack: A query-efficient decision-based attack. In S&P, 2020

  14. [14]

    Artists score major win in copyright case against ai art generators

    Cho, W. Artists score major win in copyright case against ai art generators. The Hollywood Reporter, August 2024

  15. [15]

    Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

    Clevert, D.-A. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015

  16. [16]

    Simple and controllable music generation

    Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y., and Défossez, A. Simple and controllable music generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

  17. [17]

    Simple and controllable music generation

    Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y., and D \'e fossez, A. Simple and controllable music generation. Advances in Neural Information Processing Systems, 36, 2024

  18. [18]

    FMA: A Dataset For Music Analysis

    Defferrard, M., Benzi, K., Vandergheynst, P., and Bresson, X. Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840, 2016

  19. [19]

    High Fidelity Neural Audio Compression

    Defossez, A., Copet, J., Synnaeve, G., and Adi, Y. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022

  20. [20]

    Desai, D. R. and Riedl, M. Between copyright and computer science: The law and ethics of generative ai. arXiv preprint 2403.14653, 2024. URL http://arxiv.org/pdf/2403.14653v2

  21. [21]

    Sok: Dataset copyright auditing in machine learning systems

    Du, L., Zhou, X., Chen, M., Zhang, C., Su, Z., Cheng, P., Chen, J., and Zhang, Z. Sok: Dataset copyright auditing in machine learning systems. arXiv preprint 2410.16618, 2024. URL http://arxiv.org/pdf/2410.16618v1

  22. [22]

    D., Carr, C

    Evans, Z., Parker, J. D., Carr, C. J., Zukowski, Z., Taylor, J., and Pons, J. Stable audio open. arXiv preprint arXiv:2407.14358, 2024

  23. [23]

    Gelfand, S. A. Hearing: An introduction to psychological and physiological acoustics. CRC Press, 2017

  24. [24]

    F., Ellis, D

    Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.\ 776--780. IEEE, 2017

  25. [25]

    Visqol: The virtual speech quality objective listener

    Hines, A., Skoglund, J., Kokaram, A., and Harte, N. Visqol: The virtual speech quality objective listener. In IWAENC 2012; International Workshop on Acoustic Signal Enhancement, pp.\ 1--4. VDE, 2012

  26. [26]

    Implementing a gammatone filter bank

    Holdsworth, J., Nimmo-Smith, I., Patterson, R., and Rice, P. Implementing a gammatone filter bank. Annex C of the SVOS Final Report: Part A: The Auditory Filterbank, 1: 0 1--5, 1988

  27. [27]

    N., and Wei, J

    Hu, Y., Ma, M., Lu, W., Xiong, N. N., and Wei, J. Selection of the optimal embedding positions of digital audio watermarking in wavelet domain. arXiv preprint 2010.11461, 2020. URL http://arxiv.org/pdf/2010.11461v1

  28. [28]

    Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint:1412.6980, 2014

  29. [29]

    Quality-aware masked diffusion transformer for enhanced music generation

    Li, C., Wang, R., Liu, L., Du, J., Sun, Y., Guo, Z., Zhang, Z., and Jiang, Y. Quality-aware masked diffusion transformer for enhanced music generation. arXiv preprint arXiv:2405.15863, 2024

  30. [30]

    Detecting voice cloning attacks via timbre watermarking

    Liu, C., Zhang, J., Zhang, T., Yang, X., Zhang, W., and Yu, N. Detecting voice cloning attacks via timbre watermarking. arXiv preprint 2312.03410, 2023 a . URL http://arxiv.org/pdf/2312.03410v1

  31. [31]

    Liu, H., Guo, M., Jiang, Z., Wang, L., and Gong, N. Z. Audiomarkbench: Benchmarking robustness of audio watermarking. arXiv preprint 2406.06979, 2024 a . URL http://arxiv.org/pdf/2406.06979v1

  32. [32]

    Liu, H., Yuan, Y., Liu, X., Mei, X., Kong, Q., Tian, Q., Wang, Y., Wang, W., Wang, Y., and Plumbley, M. D. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024 b

  33. [33]

    Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild

    Liu, X., Wang, X., Sahidullah, M., Patino, J., Delgado, H., Kinnunen, T., Todisco, M., Yamagishi, J., Evans, N., Nautsch, A., et al. Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023 b

  34. [34]

    Metacloak: Preventing unauthorized subject-driven text-to-image diffusion-based synthesis via meta-learning

    Liu, Y., Fan, C., Dai, Y., Chen, X., Zhou, P., and Sun, L. Metacloak: Preventing unauthorized subject-driven text-to-image diffusion-based synthesis via meta-learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 24219--24228, June 2024 c

  35. [35]

    and Michaeli, T

    Manor, H. and Michaeli, T. Zero-shot unsupervised and text-based audio editing using ddpm inversion. arXiv preprint 2402.10009, 2024. URL http://arxiv.org/pdf/2402.10009v4

  36. [36]

    Auditory time-frequency masking for spectrally and temporally maximally-compact stimuli

    Necciari, T., Laback, B., Savel, S., Ystad, S., Balazs, P., Meunier, S., and Kronland-Martinet, R. Auditory time-frequency masking for spectrally and temporally maximally-compact stimuli. PloS one, 11 0 (11): 0 e0166937, nov 2016

  37. [37]

    Office, U. C. Copyright and artificial intelligence: Part 1 – digital replicas report, 2023. URL https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-1-Digital-Replicas-Report.pdf. Accessed: 2025-01-23

  38. [38]

    Sora: Creating video from text

    OpenAI. Sora: Creating video from text. https://openai.com/sora, 2024

  39. [39]

    Pan, Y., Pan, L., Chen, W., Nakov, P., Kan, M.-Y., and Wang, W. Y. On the risk of misinformation pollution with large language models. arXiv preprint 2305.13661, 2023. URL http://arxiv.org/pdf/2305.13661v2

  40. [40]

    Librispeech: an asr corpus based on public domain audio books

    Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. Librispeech: an asr corpus based on public domain audio books. In ICASSP, 2015

  41. [41]

    S., Goldstein, S., O'Gara, A., Chen, M., and Hendrycks, D

    Park, P. S., Goldstein, S., O'Gara, A., Chen, M., and Hendrycks, D. Ai deception: A survey of examples, risks, and potential solutions. arXiv preprint 2308.14752, 2023. URL http://arxiv.org/pdf/2308.14752v1

  42. [42]

    and Xie, S

    Peebles, W. and Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 4195--4205, 2023

  43. [43]

    A lattice-based embedding method for reversible audio watermarking

    Qin, J., Lyu, S., Deng, J., Liang, X., Xiang, S., and Chen, H. A lattice-based embedding method for reversible audio watermarking. arXiv preprint 2209.07066, 2022. URL http://arxiv.org/pdf/2209.07066v1

  44. [44]

    T., Ashkinaze, J., Eaton, A

    Qiwei, L., Zhang, S., Kasper, A. T., Ashkinaze, J., Eaton, A. A., Schoenebeck, S., and Gilbert, E. Reporting non-consensual intimate media: An audit study of deepfakes. arXiv preprint 2409.12138, 2024. URL http://arxiv.org/pdf/2409.12138v1

  45. [45]

    I., and Bittner, R

    Rafii, Z., Liutkus, A., St \"o ter, F.-R., Mimilakis, S. I., and Bittner, R. The MUSDB18 corpus for music separation, December 2017. URL https://doi.org/10.5281/zenodo.1117372

  46. [46]

    Copyright protection in generative ai: A technical perspective

    Ren, J., Xu, H., He, P., Cui, Y., Zeng, S., Zhang, J., Wen, H., Ding, J., Huang, P., Lyu, L., Liu, H., Chang, Y., and Tang, J. Copyright protection in generative ai: A technical perspective. arXiv preprint 2402.02333, 2024. URL http://arxiv.org/pdf/2402.02333v2

  47. [47]

    W., Beerends, J

    Rix, A. W., Beerends, J. G., Hollier, M. P., and Hekstra, A. P. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), volume 2, pp.\ 749--752. IEEE, 2001

  48. [48]

    Will umg's lawsuit against believe usher in a new era of regulation for diy distribution? Billboard, November 2024

    Robinson, K. Will umg's lawsuit against believe usher in a new era of regulation for diy distribution? Billboard, November 2024. URL https://www.billboard.com/pro/umg-lawsuit-believe-regulation-diy-distribution/

  49. [49]

    Proactive detection of voice cloning with localized watermarking

    San Roman, R., Fernandez, P., Elsahar, H., D \'e fossez, A., Furon, T., and Tran, T. Proactive detection of voice cloning with localized watermarking. In International Conference on Machine Learning, volume 235, 2024

  50. [50]

    Method for the subjective assessment of intermediate quality level of audio systems

    Series, B. Method for the subjective assessment of intermediate quality level of audio systems. Technical report, International Telecommunication Union Radiocommunication Assembly, 2014

  51. [51]

    Series, B. S. Algorithms to measure audio programme loudness and true-peak audio level. In International Telecommunication Union Radiocommunication Assembly, 2011

  52. [52]

    R., Wang, Z., Ahvanooey, M

    Shoaib, M. R., Wang, Z., Ahvanooey, M. T., and Zhao, J. Deepfakes, misinformation, and disinformation in the era of frontier ai, generative ai, and large ai models. arXiv preprint 2311.17394, 2023. URL http://arxiv.org/pdf/2311.17394v1

  53. [53]

    Teen girls confront an epidemic of deepfake nudes in schools

    Singer, N. Teen girls confront an epidemic of deepfake nudes in schools. The New York Times, April 2024. URL https://www.nytimes.com/2024/04/08/technology/deepfake-ai-nudes-westfield-high-school.html. Retrieved April 8, 2024

  54. [54]

    H., Hendriks, R

    Taal, C. H., Hendriks, R. C., Heusdens, R., and Jensen, J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.\ 4214--4217. IEEE, 2010

  55. [55]

    Attention is all you need

    Vaswani, A. Attention is all you need. Advances in Neural Information Processing Systems, 2017

  56. [56]

    Ai is spawning a flood of fake trump and harris voices

    Verma, P., Tenjarla, R., and Sand, B. Ai is spawning a flood of fake trump and harris voices. here's how to tell what's real. The Washington Post, October 2024. Retrieved 6:05 a.m

  57. [57]

    Wme announces first major ai partnership with authenticated ai company vermillio to protect artists and create new revenue opportunities

    Vermillio. Wme announces first major ai partnership with authenticated ai company vermillio to protect artists and create new revenue opportunities. Vermillio Blog, January 2024

  58. [58]

    Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation

    Wang, C., Riviere, M., Lee, A., Wu, A., Talnikar, C., Haziza, D., Williamson, M., Pino, J., and Dupoux, E. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint arXiv:2101.00390, 2021

  59. [59]

    hello, it's me

    Wenger, E., Bronckers, M., Cianfarani, C., Cryan, J., Sha, A., Zheng, H., and Zhao, B. Y. "hello, it's me": Deep learning-based speech synthesis attacks in the real world. arXiv preprint 2109.09598, 2021. URL http://arxiv.org/pdf/2109.09598v1

  60. [60]

    audiowmark: Audio watermarking

    Westerfeld, S. audiowmark: Audio watermarking. https://github.com/swesterfeld/audiowmark, 2020. Accessed: 2025-01-02

  61. [61]

    Yang, P., Ci, H., Song, Y., and Shou, M. Z. Can simple averaging defeat modern watermarks? In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  62. [62]

    Youden, W. J. Index for rating diagnostic tests. Cancer, 3 0 (1): 0 32--35, 1950

  63. [63]

    A time-frequency perspective on audio watermarking

    Zhang, H. A time-frequency perspective on audio watermarking. arXiv preprint 2002.03156, 2020. URL http://arxiv.org/pdf/2002.03156v1

  64. [64]

    Robust Audio Watermarking Algorithm Based on Moving Average and DCT

    Zhang, J. and Han, B. Robust audio watermarking algorithm based on moving average and dct. arXiv preprint 1704.02755, 2017. URL http://arxiv.org/pdf/1704.02755v1

  65. [65]

    and Fastl, H

    Zwicker, E. and Fastl, H. Psychoacoustics: Facts and Models. Springer Science & Business Media, mar 2013

  66. [66]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  67. [67]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  68. [68]

    Adding Conditional Control to Text-to-Image Diffusion Models

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...