XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

Andrea Fanelli; Jihui Jin; Lichao Sun; Lie Lu; Yixin Liu

arxiv: 2502.04230 · v3 · pith:KOF4TXN4new · submitted 2025-02-06 · 💻 cs.SD · cs.AI· cs.CR· cs.LG· eess.AS

XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

Yixin Liu , Lie Lu , Jihui Jin , Lichao Sun , Andrea Fanelli This is my paper

Pith reviewed 2026-05-25 08:28 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CRcs.LGeess.AS

keywords audio watermarkingcross-attentionrobust detectionattributiongenerative audiopsychoacoustic maskingparameter sharing

0 comments

The pith

XATTNMARK uses cross-attention and partial parameter sharing to jointly optimize audio watermark detection and attribution under generative edits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces XATTNMARK to address the difficulty of making watermarks that are both easy to detect and easy to trace back to their source in generative audio. Prior methods improve one goal at the expense of the other, but this approach shares some parameters between the generator that embeds the signal and the detector that reads it, then adds a cross-attention step to pull out the embedded message efficiently. A temporal conditioning module spreads the message across time, while a new loss term based on time-frequency masking keeps the watermark from being audible. If the method works as described, it would let audio platforms and rights holders reliably identify and attribute content even after heavy editing or synthesis transformations.

Core claim

XATTNMARK bridges the gap between robust detection and accurate attribution by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, a temporal conditioning module for improved message distribution, and a psychoacoustic-aligned time-frequency masking loss that captures fine-grained auditory masking effects, achieving state-of-the-art performance in both detection and attribution with superior robustness against a wide range of audio transformations including challenging generative editing at varying strengths.

What carries the argument

Cross-attention mechanism for message retrieval combined with partial parameter sharing between generator and detector.

If this is right

XATTNMARK demonstrates superior robustness against generative editing at varying strengths.
The psychoacoustic-aligned TF masking loss improves watermark imperceptibility.
Partial parameter sharing enables joint optimization of detection and attribution.
The approach advances audio watermarking for protecting intellectual property and ensuring authenticity in generative AI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the cross-attention design proves stable, it could be adapted to watermark other time-series data such as video or sensor streams.
Deployment at scale would likely require additional checks against removal attacks that target the shared parameters specifically.
Integration into commercial audio generators might create a de-facto standard for provenance tracking without separate post-processing steps.

Load-bearing premise

The cross-attention mechanism together with partial parameter sharing will jointly optimize robust detection and accurate attribution without introducing new failure modes under real-world generative editing pipelines.

What would settle it

A controlled test in which audio is passed through strong generative editing models at multiple strength levels and either watermark detection rate or attribution accuracy falls below the levels reported for WavMark or AudioSeal.

Figures

Figures reproduced from arXiv: 2502.04230 by Andrea Fanelli, Jihui Jin, Lichao Sun, Lie Lu, Yixin Liu.

**Figure 1.** Figure 1: Quality-attribution performance trade-off curve across different watermarking strengths and the overall performance comparison on detection and attribution tasks. Higher values on both axes indicate better performance. Copet et al., 2024). While it democratizes the creative process and enables new applications, it also brings serious concerns for copyrighted data misuse, data provenance and authenticity … view at source ↗

**Figure 2.** Figure 2: System Diagram for XATTNMARK. XATTNMARK consists of a watermark generator and a watermark detector, with a shared embedding table that facilitates message decoding through a cross-attention module. In the generator part, we first employ an encoder network to encode the audio latent and then apply a temporal modulation to hide the message. The modulated latent is then fed into a decoder to produce the water… view at source ↗

**Figure 3.** Figure 3: Attribution accuracy with different #Users. 0 20 40 60 80 100 Training Steps (k) 50 60 70 80 90 100 Message Bit Accuracy (%) Ours w/o Modulation w/o CrossAttn w/o Both [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: MUSHRA subjective listening test results comparing perceptual quality across different watermarking methods. Higher scores indicate better audio quality as rated by human listeners. Our method achieves quality scores comparable to AudioSeal. 0 10000 20000 30000 40000 50000 Training Steps 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy WavMark Detection Message 0 10000 20000 30000 40000 50000 Training Steps 0.5 0.6 0.7 0.… view at source ↗

**Figure 6.** Figure 6: Validation accuracy and quality curve of different methods over training steps. Using a blended architecture, XATTNMARK is able to achieve a better balance in terms of learning efficiency in detection, and message-bit decoding and watermark imperceptibility. Compared to our method, WavMark, the fully-shared architecture, suffers from robustness degradation in detection as the training progresses; the fully… view at source ↗

**Figure 7.** Figure 7: Detection and Attribution Accuracy of our method on augmented samples across different augmentation strengths. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: The detection and attribution accuracy of XATTNMARK across different audio duration and watermark strength α on MusicCaps. The performance is averaged over all the standard audio editing. The attribution pool size is set to {100, 1000, 10000}. key insights: (1) Our method introduces watermarks with the lowest energy, making them less perceptible in both time and frequency domains. (2) The spectral patterns… view at source ↗

**Figure 9.** Figure 9: Visualization of the original audio and watermark residuals across four different methods for a randomly selected sample from MusicCaps dataset, showing waveform (left), spectrogram (middle), and mel-spectrogram (right) representations. 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 Time Bin 0 5 10 15 Frequency Band (a) AudioSeal 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 Time Bin 0 5 10 15 (b) Ours 0.0 … view at source ↗

**Figure 10.** Figure 10: Visualization of the per-tile masking penalty distribution. The blue regions indicate tiles assigned a lower penalty for embedding a watermark. Our weighting scheme provides more fine-grained control. The audio sample is randomly selected from MusicCap (Agostinelli et al., 2023), and the watermarked audio is generated from a model at the very beginning of training. 24 [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

read the original abstract

The rapid proliferation of generative audio synthesis and editing technologies has raised serious concerns about copyright infringement, data provenance, and the spread of misinformation via deepfake audio. Watermarking offers a proactive solution by embedding imperceptible yet identifiable and traceable signals into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to jointly optimize both robust detection and accurate attribution. This paper introduces Cross-Attention Robust Audio Watermark (XATTNMARK), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned time-frequency (TF) masking loss that captures fine-grained auditory masking effects, improving watermark imperceptibility. XATTNMARK achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing at varying strengths. This work advances audio watermarking for protecting intellectual property and ensuring authenticity in the era of generative AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces XAttnMark, a neural audio watermarking approach that uses partial parameter sharing between the generator and detector, a cross-attention mechanism for message retrieval, a temporal conditioning module, and a psychoacoustic-aligned time-frequency (TF) masking loss. It claims state-of-the-art performance on both detection and attribution tasks, with improved robustness to a range of audio transformations including generative editing at varying strengths, addressing limitations in prior methods such as WavMark and AudioSeal.

Significance. If the empirical claims hold, the work would advance audio watermarking by demonstrating joint optimization of detection and attribution under realistic generative edits, which is a practically relevant gap. The cross-attention design and TF masking loss represent targeted technical contributions that could improve message recoverability and imperceptibility.

major comments (2)

[Abstract] Abstract: the central SOTA claim for joint detection+attribution robustness is stated without any quantitative results, baselines, datasets, ablations, or error bars, rendering it impossible to verify whether the cross-attention and partial-sharing design actually supports the performance assertions or whether post-hoc choices affect the outcome.
[Abstract] Abstract: the assumption that cross-attention message retrieval together with partial generator-detector parameter sharing and temporal conditioning will avoid new failure modes under high-strength generative edits (e.g., diffusion-based or voice-conversion pipelines) is untested in the visible text; no ablation isolating the effect of shared parameters on attribution accuracy after such edits is provided, which is load-bearing for the robustness claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central SOTA claim for joint detection+attribution robustness is stated without any quantitative results, baselines, datasets, ablations, or error bars, rendering it impossible to verify whether the cross-attention and partial-sharing design actually supports the performance assertions or whether post-hoc choices affect the outcome.

Authors: We agree that the abstract would benefit from including key quantitative results to make the SOTA claim more verifiable on its own. The full manuscript contains the supporting experiments with baselines (WavMark, AudioSeal), datasets, ablations, and error bars in the Experiments section. We will revise the abstract to incorporate specific metrics on detection and attribution robustness. revision: yes
Referee: [Abstract] Abstract: the assumption that cross-attention message retrieval together with partial generator-detector parameter sharing and temporal conditioning will avoid new failure modes under high-strength generative edits (e.g., diffusion-based or voice-conversion pipelines) is untested in the visible text; no ablation isolating the effect of shared parameters on attribution accuracy after such edits is provided, which is load-bearing for the robustness claim.

Authors: The manuscript reports evaluations of robustness under generative editing at varying strengths. We acknowledge, however, that an explicit ablation isolating the contribution of partial parameter sharing to attribution accuracy specifically after high-strength edits is not present. We will add this ablation to strengthen the evidence for the design choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML claims with no derivation chain or self-referential fitting

full rationale

The provided abstract and context describe an empirical neural audio watermarking model (cross-attention + partial parameter sharing + TF masking loss) whose central claims are SOTA detection/attribution numbers on transformed audio. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the text. Performance results are obtained by standard training/evaluation and do not reduce to inputs by construction. This is the normal non-circular case for a learning-based method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities are described in sufficient detail to populate the ledger.

pith-pipeline@v0.9.0 · 5745 in / 1224 out tokens · 18898 ms · 2026-05-25T08:28:32.400374+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech
cs.SD 2026-05 unverdicted novelty 7.0

MelShield adds keyed low-energy spread-spectrum perturbations to Mel-spectrograms inside TTS pipelines before vocoding to enable robust extraction of user-specific attribution signals even after compression or noise.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[3]

and Rothman, E

Abbott, R. and Rothman, E. Disrupting creativity: Copyright law in the age of generative artificial intelligence. Fla. L. Rev., 75: 0 1141, 2023

work page 2023
[4]

H., Awumey, E., and Das, S

Agnew, W., Barnett, J., Chu, A., Hong, R., Feffer, M., Netzorg, R., Jiang, H. H., Awumey, E., and Das, S. Sound check: Auditing audio datasets. arXiv preprint 2410.13114, 2024. URL http://arxiv.org/pdf/2410.13114v1

work page arXiv 2024
[5]

MusicLM: Generating Music From Text

Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

M., and Weber, G

Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M., and Weber, G. Common voice: A massively-multilingual speech corpus. In LREC, 2020

work page 2020
[7]

F., and Pardo, B

Barnett, J., Garcia, H. F., and Pardo, B. Exploring musical roots: Applying audio embeddings to empower influence attribution for a generative music model. arXiv preprint 2401.14542, 2024. URL http://arxiv.org/pdf/2401.14542v1

work page arXiv 2024
[8]

Hello me, meet the real me: Audio deepfake attacks on voice assistants

Bilika, D., Michopoulou, N., Alepis, E., and Patsakis, C. Hello me, meet the real me: Audio deepfake attacks on voice assistants. arXiv preprint 2302.10328, 2023. URL http://arxiv.org/pdf/2302.10328v1

work page arXiv 2023
[9]

Iso/iec mpeg-2 advanced audio coding

Bosi, M., Brandenburg, K., Quackenbush, S., Fielder, L., Akagiri, K., Fuchs, H., and Dietz, M. Iso/iec mpeg-2 advanced audio coding. Journal of the Audio engineering society, 45 0 (10): 0 789--814, 1997

work page 1997
[10]

violation of my body:

Brigham, N. G., Wei, M., Kohno, T., and Redmiles, E. M. "violation of my body:" perceptions of ai-generated non-consensual (intimate) imagery. arXiv preprint 2406.05520, 2024. URL http://arxiv.org/pdf/2406.05520v2

work page arXiv 2024
[11]

Buo, S. A. The emerging threats of deepfake attacks and countermeasures. arXiv preprint 2012.07989, 2020. URL http://arxiv.org/pdf/2012.07989v1

work page arXiv 2012
[12]

Wavmark: Watermarking for audio generation

Chen, G., Wu, Y., Liu, S., Liu, T., Du, X., and Wei, F. Wavmark: Watermarking for audio generation. arXiv preprint arXiv:2308.12770, 2023

work page arXiv 2023
[13]

I., and Wainwright, M

Chen, J., Jordan, M. I., and Wainwright, M. J. Hopskipjumpattack: A query-efficient decision-based attack. In S&P, 2020

work page 2020
[14]

Artists score major win in copyright case against ai art generators

Cho, W. Artists score major win in copyright case against ai art generators. The Hollywood Reporter, August 2024

work page 2024
[15]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Clevert, D.-A. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[16]

Simple and controllable music generation

Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y., and Défossez, A. Simple and controllable music generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[17]

Simple and controllable music generation

Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y., and D \'e fossez, A. Simple and controllable music generation. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[18]

FMA: A Dataset For Music Analysis

Defferrard, M., Benzi, K., Vandergheynst, P., and Bresson, X. Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

High Fidelity Neural Audio Compression

Defossez, A., Copet, J., Synnaeve, G., and Adi, Y. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Desai, D. R. and Riedl, M. Between copyright and computer science: The law and ethics of generative ai. arXiv preprint 2403.14653, 2024. URL http://arxiv.org/pdf/2403.14653v2

work page arXiv 2024
[21]

Sok: Dataset copyright auditing in machine learning systems

Du, L., Zhou, X., Chen, M., Zhang, C., Su, Z., Cheng, P., Chen, J., and Zhang, Z. Sok: Dataset copyright auditing in machine learning systems. arXiv preprint 2410.16618, 2024. URL http://arxiv.org/pdf/2410.16618v1

work page arXiv 2024
[22]

D., Carr, C

Evans, Z., Parker, J. D., Carr, C. J., Zukowski, Z., Taylor, J., and Pons, J. Stable audio open. arXiv preprint arXiv:2407.14358, 2024

work page arXiv 2024
[23]

Gelfand, S. A. Hearing: An introduction to psychological and physiological acoustics. CRC Press, 2017

work page 2017
[24]

F., Ellis, D

Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.\ 776--780. IEEE, 2017

work page 2017
[25]

Visqol: The virtual speech quality objective listener

Hines, A., Skoglund, J., Kokaram, A., and Harte, N. Visqol: The virtual speech quality objective listener. In IWAENC 2012; International Workshop on Acoustic Signal Enhancement, pp.\ 1--4. VDE, 2012

work page 2012
[26]

Implementing a gammatone filter bank

Holdsworth, J., Nimmo-Smith, I., Patterson, R., and Rice, P. Implementing a gammatone filter bank. Annex C of the SVOS Final Report: Part A: The Auditory Filterbank, 1: 0 1--5, 1988

work page 1988
[27]

N., and Wei, J

Hu, Y., Ma, M., Lu, W., Xiong, N. N., and Wei, J. Selection of the optimal embedding positions of digital audio watermarking in wavelet domain. arXiv preprint 2010.11461, 2020. URL http://arxiv.org/pdf/2010.11461v1

work page arXiv 2010
[28]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[29]

Quality-aware masked diffusion transformer for enhanced music generation

Li, C., Wang, R., Liu, L., Du, J., Sun, Y., Guo, Z., Zhang, Z., and Jiang, Y. Quality-aware masked diffusion transformer for enhanced music generation. arXiv preprint arXiv:2405.15863, 2024

work page arXiv 2024
[30]

Detecting voice cloning attacks via timbre watermarking

Liu, C., Zhang, J., Zhang, T., Yang, X., Zhang, W., and Yu, N. Detecting voice cloning attacks via timbre watermarking. arXiv preprint 2312.03410, 2023 a . URL http://arxiv.org/pdf/2312.03410v1

work page arXiv 2023
[31]

Liu, H., Guo, M., Jiang, Z., Wang, L., and Gong, N. Z. Audiomarkbench: Benchmarking robustness of audio watermarking. arXiv preprint 2406.06979, 2024 a . URL http://arxiv.org/pdf/2406.06979v1

work page arXiv 2024
[32]

Liu, H., Yuan, Y., Liu, X., Mei, X., Kong, Q., Tian, Q., Wang, Y., Wang, W., Wang, Y., and Plumbley, M. D. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024 b

work page 2024
[33]

Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild

Liu, X., Wang, X., Sahidullah, M., Patino, J., Delgado, H., Kinnunen, T., Todisco, M., Yamagishi, J., Evans, N., Nautsch, A., et al. Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023 b

work page 2021
[34]

Metacloak: Preventing unauthorized subject-driven text-to-image diffusion-based synthesis via meta-learning

Liu, Y., Fan, C., Dai, Y., Chen, X., Zhou, P., and Sun, L. Metacloak: Preventing unauthorized subject-driven text-to-image diffusion-based synthesis via meta-learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 24219--24228, June 2024 c

work page 2024
[35]

and Michaeli, T

Manor, H. and Michaeli, T. Zero-shot unsupervised and text-based audio editing using ddpm inversion. arXiv preprint 2402.10009, 2024. URL http://arxiv.org/pdf/2402.10009v4

work page arXiv 2024
[36]

Auditory time-frequency masking for spectrally and temporally maximally-compact stimuli

Necciari, T., Laback, B., Savel, S., Ystad, S., Balazs, P., Meunier, S., and Kronland-Martinet, R. Auditory time-frequency masking for spectrally and temporally maximally-compact stimuli. PloS one, 11 0 (11): 0 e0166937, nov 2016

work page 2016
[37]

Office, U. C. Copyright and artificial intelligence: Part 1 – digital replicas report, 2023. URL https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-1-Digital-Replicas-Report.pdf. Accessed: 2025-01-23

work page 2023
[38]

Sora: Creating video from text

OpenAI. Sora: Creating video from text. https://openai.com/sora, 2024

work page 2024
[39]

Pan, Y., Pan, L., Chen, W., Nakov, P., Kan, M.-Y., and Wang, W. Y. On the risk of misinformation pollution with large language models. arXiv preprint 2305.13661, 2023. URL http://arxiv.org/pdf/2305.13661v2

work page arXiv 2023
[40]

Librispeech: an asr corpus based on public domain audio books

Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. Librispeech: an asr corpus based on public domain audio books. In ICASSP, 2015

work page 2015
[41]

S., Goldstein, S., O'Gara, A., Chen, M., and Hendrycks, D

Park, P. S., Goldstein, S., O'Gara, A., Chen, M., and Hendrycks, D. Ai deception: A survey of examples, risks, and potential solutions. arXiv preprint 2308.14752, 2023. URL http://arxiv.org/pdf/2308.14752v1

work page arXiv 2023
[42]

and Xie, S

Peebles, W. and Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 4195--4205, 2023

work page 2023
[43]

A lattice-based embedding method for reversible audio watermarking

Qin, J., Lyu, S., Deng, J., Liang, X., Xiang, S., and Chen, H. A lattice-based embedding method for reversible audio watermarking. arXiv preprint 2209.07066, 2022. URL http://arxiv.org/pdf/2209.07066v1

work page arXiv 2022
[44]

T., Ashkinaze, J., Eaton, A

Qiwei, L., Zhang, S., Kasper, A. T., Ashkinaze, J., Eaton, A. A., Schoenebeck, S., and Gilbert, E. Reporting non-consensual intimate media: An audit study of deepfakes. arXiv preprint 2409.12138, 2024. URL http://arxiv.org/pdf/2409.12138v1

work page arXiv 2024
[45]

I., and Bittner, R

Rafii, Z., Liutkus, A., St \"o ter, F.-R., Mimilakis, S. I., and Bittner, R. The MUSDB18 corpus for music separation, December 2017. URL https://doi.org/10.5281/zenodo.1117372

work page doi:10.5281/zenodo.1117372 2017
[46]

Copyright protection in generative ai: A technical perspective

Ren, J., Xu, H., He, P., Cui, Y., Zeng, S., Zhang, J., Wen, H., Ding, J., Huang, P., Lyu, L., Liu, H., Chang, Y., and Tang, J. Copyright protection in generative ai: A technical perspective. arXiv preprint 2402.02333, 2024. URL http://arxiv.org/pdf/2402.02333v2

work page arXiv 2024
[47]

W., Beerends, J

Rix, A. W., Beerends, J. G., Hollier, M. P., and Hekstra, A. P. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), volume 2, pp.\ 749--752. IEEE, 2001

work page 2001
[48]

Will umg's lawsuit against believe usher in a new era of regulation for diy distribution? Billboard, November 2024

Robinson, K. Will umg's lawsuit against believe usher in a new era of regulation for diy distribution? Billboard, November 2024. URL https://www.billboard.com/pro/umg-lawsuit-believe-regulation-diy-distribution/

work page 2024
[49]

Proactive detection of voice cloning with localized watermarking

San Roman, R., Fernandez, P., Elsahar, H., D \'e fossez, A., Furon, T., and Tran, T. Proactive detection of voice cloning with localized watermarking. In International Conference on Machine Learning, volume 235, 2024

work page 2024
[50]

Method for the subjective assessment of intermediate quality level of audio systems

Series, B. Method for the subjective assessment of intermediate quality level of audio systems. Technical report, International Telecommunication Union Radiocommunication Assembly, 2014

work page 2014
[51]

Series, B. S. Algorithms to measure audio programme loudness and true-peak audio level. In International Telecommunication Union Radiocommunication Assembly, 2011

work page 2011
[52]

R., Wang, Z., Ahvanooey, M

Shoaib, M. R., Wang, Z., Ahvanooey, M. T., and Zhao, J. Deepfakes, misinformation, and disinformation in the era of frontier ai, generative ai, and large ai models. arXiv preprint 2311.17394, 2023. URL http://arxiv.org/pdf/2311.17394v1

work page arXiv 2023
[53]

Teen girls confront an epidemic of deepfake nudes in schools

Singer, N. Teen girls confront an epidemic of deepfake nudes in schools. The New York Times, April 2024. URL https://www.nytimes.com/2024/04/08/technology/deepfake-ai-nudes-westfield-high-school.html. Retrieved April 8, 2024

work page 2024
[54]

H., Hendriks, R

Taal, C. H., Hendriks, R. C., Heusdens, R., and Jensen, J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.\ 4214--4217. IEEE, 2010

work page 2010
[55]

Attention is all you need

Vaswani, A. Attention is all you need. Advances in Neural Information Processing Systems, 2017

work page 2017
[56]

Ai is spawning a flood of fake trump and harris voices

Verma, P., Tenjarla, R., and Sand, B. Ai is spawning a flood of fake trump and harris voices. here's how to tell what's real. The Washington Post, October 2024. Retrieved 6:05 a.m

work page 2024
[57]

Wme announces first major ai partnership with authenticated ai company vermillio to protect artists and create new revenue opportunities

Vermillio. Wme announces first major ai partnership with authenticated ai company vermillio to protect artists and create new revenue opportunities. Vermillio Blog, January 2024

work page 2024
[58]

Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation

Wang, C., Riviere, M., Lee, A., Wu, A., Talnikar, C., Haziza, D., Williamson, M., Pino, J., and Dupoux, E. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint arXiv:2101.00390, 2021

work page arXiv 2021
[59]

hello, it's me

Wenger, E., Bronckers, M., Cianfarani, C., Cryan, J., Sha, A., Zheng, H., and Zhao, B. Y. "hello, it's me": Deep learning-based speech synthesis attacks in the real world. arXiv preprint 2109.09598, 2021. URL http://arxiv.org/pdf/2109.09598v1

work page arXiv 2021
[60]

audiowmark: Audio watermarking

Westerfeld, S. audiowmark: Audio watermarking. https://github.com/swesterfeld/audiowmark, 2020. Accessed: 2025-01-02

work page 2020
[61]

Yang, P., Ci, H., Song, Y., and Shou, M. Z. Can simple averaging defeat modern watermarks? In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[62]

Youden, W. J. Index for rating diagnostic tests. Cancer, 3 0 (1): 0 32--35, 1950

work page 1950
[63]

A time-frequency perspective on audio watermarking

Zhang, H. A time-frequency perspective on audio watermarking. arXiv preprint 2002.03156, 2020. URL http://arxiv.org/pdf/2002.03156v1

work page arXiv 2002
[64]

Robust Audio Watermarking Algorithm Based on Moving Average and DCT

Zhang, J. and Han, B. Robust audio watermarking algorithm based on moving average and dct. arXiv preprint 1704.02755, 2017. URL http://arxiv.org/pdf/1704.02755v1

work page internal anchor Pith review Pith/arXiv arXiv 2017
[65]

and Fastl, H

Zwicker, E. and Fastl, H. Psychoacoustics: Facts and Models. Springer Science & Business Media, mar 2013

work page 2013
[66]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[67]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[68]

Adding Conditional Control to Text-to-Image Diffusion Models

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3240323.3241729 2023

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[3] [3]

and Rothman, E

Abbott, R. and Rothman, E. Disrupting creativity: Copyright law in the age of generative artificial intelligence. Fla. L. Rev., 75: 0 1141, 2023

work page 2023

[4] [4]

H., Awumey, E., and Das, S

Agnew, W., Barnett, J., Chu, A., Hong, R., Feffer, M., Netzorg, R., Jiang, H. H., Awumey, E., and Das, S. Sound check: Auditing audio datasets. arXiv preprint 2410.13114, 2024. URL http://arxiv.org/pdf/2410.13114v1

work page arXiv 2024

[5] [5]

MusicLM: Generating Music From Text

Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

M., and Weber, G

Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M., and Weber, G. Common voice: A massively-multilingual speech corpus. In LREC, 2020

work page 2020

[7] [7]

F., and Pardo, B

Barnett, J., Garcia, H. F., and Pardo, B. Exploring musical roots: Applying audio embeddings to empower influence attribution for a generative music model. arXiv preprint 2401.14542, 2024. URL http://arxiv.org/pdf/2401.14542v1

work page arXiv 2024

[8] [8]

Hello me, meet the real me: Audio deepfake attacks on voice assistants

Bilika, D., Michopoulou, N., Alepis, E., and Patsakis, C. Hello me, meet the real me: Audio deepfake attacks on voice assistants. arXiv preprint 2302.10328, 2023. URL http://arxiv.org/pdf/2302.10328v1

work page arXiv 2023

[9] [9]

Iso/iec mpeg-2 advanced audio coding

Bosi, M., Brandenburg, K., Quackenbush, S., Fielder, L., Akagiri, K., Fuchs, H., and Dietz, M. Iso/iec mpeg-2 advanced audio coding. Journal of the Audio engineering society, 45 0 (10): 0 789--814, 1997

work page 1997

[10] [10]

violation of my body:

Brigham, N. G., Wei, M., Kohno, T., and Redmiles, E. M. "violation of my body:" perceptions of ai-generated non-consensual (intimate) imagery. arXiv preprint 2406.05520, 2024. URL http://arxiv.org/pdf/2406.05520v2

work page arXiv 2024

[11] [11]

Buo, S. A. The emerging threats of deepfake attacks and countermeasures. arXiv preprint 2012.07989, 2020. URL http://arxiv.org/pdf/2012.07989v1

work page arXiv 2012

[12] [12]

Wavmark: Watermarking for audio generation

Chen, G., Wu, Y., Liu, S., Liu, T., Du, X., and Wei, F. Wavmark: Watermarking for audio generation. arXiv preprint arXiv:2308.12770, 2023

work page arXiv 2023

[13] [13]

I., and Wainwright, M

Chen, J., Jordan, M. I., and Wainwright, M. J. Hopskipjumpattack: A query-efficient decision-based attack. In S&P, 2020

work page 2020

[14] [14]

Artists score major win in copyright case against ai art generators

Cho, W. Artists score major win in copyright case against ai art generators. The Hollywood Reporter, August 2024

work page 2024

[15] [15]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Clevert, D.-A. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[16] [16]

Simple and controllable music generation

Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y., and Défossez, A. Simple and controllable music generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023

[17] [17]

Simple and controllable music generation

Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y., and D \'e fossez, A. Simple and controllable music generation. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[18] [18]

FMA: A Dataset For Music Analysis

Defferrard, M., Benzi, K., Vandergheynst, P., and Bresson, X. Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[19] [19]

High Fidelity Neural Audio Compression

Defossez, A., Copet, J., Synnaeve, G., and Adi, Y. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

Desai, D. R. and Riedl, M. Between copyright and computer science: The law and ethics of generative ai. arXiv preprint 2403.14653, 2024. URL http://arxiv.org/pdf/2403.14653v2

work page arXiv 2024

[21] [21]

Sok: Dataset copyright auditing in machine learning systems

Du, L., Zhou, X., Chen, M., Zhang, C., Su, Z., Cheng, P., Chen, J., and Zhang, Z. Sok: Dataset copyright auditing in machine learning systems. arXiv preprint 2410.16618, 2024. URL http://arxiv.org/pdf/2410.16618v1

work page arXiv 2024

[22] [22]

D., Carr, C

Evans, Z., Parker, J. D., Carr, C. J., Zukowski, Z., Taylor, J., and Pons, J. Stable audio open. arXiv preprint arXiv:2407.14358, 2024

work page arXiv 2024

[23] [23]

Gelfand, S. A. Hearing: An introduction to psychological and physiological acoustics. CRC Press, 2017

work page 2017

[24] [24]

F., Ellis, D

Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.\ 776--780. IEEE, 2017

work page 2017

[25] [25]

Visqol: The virtual speech quality objective listener

Hines, A., Skoglund, J., Kokaram, A., and Harte, N. Visqol: The virtual speech quality objective listener. In IWAENC 2012; International Workshop on Acoustic Signal Enhancement, pp.\ 1--4. VDE, 2012

work page 2012

[26] [26]

Implementing a gammatone filter bank

Holdsworth, J., Nimmo-Smith, I., Patterson, R., and Rice, P. Implementing a gammatone filter bank. Annex C of the SVOS Final Report: Part A: The Auditory Filterbank, 1: 0 1--5, 1988

work page 1988

[27] [27]

N., and Wei, J

Hu, Y., Ma, M., Lu, W., Xiong, N. N., and Wei, J. Selection of the optimal embedding positions of digital audio watermarking in wavelet domain. arXiv preprint 2010.11461, 2020. URL http://arxiv.org/pdf/2010.11461v1

work page arXiv 2010

[28] [28]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[29] [29]

Quality-aware masked diffusion transformer for enhanced music generation

Li, C., Wang, R., Liu, L., Du, J., Sun, Y., Guo, Z., Zhang, Z., and Jiang, Y. Quality-aware masked diffusion transformer for enhanced music generation. arXiv preprint arXiv:2405.15863, 2024

work page arXiv 2024

[30] [30]

Detecting voice cloning attacks via timbre watermarking

Liu, C., Zhang, J., Zhang, T., Yang, X., Zhang, W., and Yu, N. Detecting voice cloning attacks via timbre watermarking. arXiv preprint 2312.03410, 2023 a . URL http://arxiv.org/pdf/2312.03410v1

work page arXiv 2023

[31] [31]

Liu, H., Guo, M., Jiang, Z., Wang, L., and Gong, N. Z. Audiomarkbench: Benchmarking robustness of audio watermarking. arXiv preprint 2406.06979, 2024 a . URL http://arxiv.org/pdf/2406.06979v1

work page arXiv 2024

[32] [32]

Liu, H., Yuan, Y., Liu, X., Mei, X., Kong, Q., Tian, Q., Wang, Y., Wang, W., Wang, Y., and Plumbley, M. D. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024 b

work page 2024

[33] [33]

Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild

Liu, X., Wang, X., Sahidullah, M., Patino, J., Delgado, H., Kinnunen, T., Todisco, M., Yamagishi, J., Evans, N., Nautsch, A., et al. Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023 b

work page 2021

[34] [34]

Metacloak: Preventing unauthorized subject-driven text-to-image diffusion-based synthesis via meta-learning

Liu, Y., Fan, C., Dai, Y., Chen, X., Zhou, P., and Sun, L. Metacloak: Preventing unauthorized subject-driven text-to-image diffusion-based synthesis via meta-learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 24219--24228, June 2024 c

work page 2024

[35] [35]

and Michaeli, T

Manor, H. and Michaeli, T. Zero-shot unsupervised and text-based audio editing using ddpm inversion. arXiv preprint 2402.10009, 2024. URL http://arxiv.org/pdf/2402.10009v4

work page arXiv 2024

[36] [36]

Auditory time-frequency masking for spectrally and temporally maximally-compact stimuli

Necciari, T., Laback, B., Savel, S., Ystad, S., Balazs, P., Meunier, S., and Kronland-Martinet, R. Auditory time-frequency masking for spectrally and temporally maximally-compact stimuli. PloS one, 11 0 (11): 0 e0166937, nov 2016

work page 2016

[37] [37]

Office, U. C. Copyright and artificial intelligence: Part 1 – digital replicas report, 2023. URL https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-1-Digital-Replicas-Report.pdf. Accessed: 2025-01-23

work page 2023

[38] [38]

Sora: Creating video from text

OpenAI. Sora: Creating video from text. https://openai.com/sora, 2024

work page 2024

[39] [39]

Pan, Y., Pan, L., Chen, W., Nakov, P., Kan, M.-Y., and Wang, W. Y. On the risk of misinformation pollution with large language models. arXiv preprint 2305.13661, 2023. URL http://arxiv.org/pdf/2305.13661v2

work page arXiv 2023

[40] [40]

Librispeech: an asr corpus based on public domain audio books

Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. Librispeech: an asr corpus based on public domain audio books. In ICASSP, 2015

work page 2015

[41] [41]

S., Goldstein, S., O'Gara, A., Chen, M., and Hendrycks, D

Park, P. S., Goldstein, S., O'Gara, A., Chen, M., and Hendrycks, D. Ai deception: A survey of examples, risks, and potential solutions. arXiv preprint 2308.14752, 2023. URL http://arxiv.org/pdf/2308.14752v1

work page arXiv 2023

[42] [42]

and Xie, S

Peebles, W. and Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 4195--4205, 2023

work page 2023

[43] [43]

A lattice-based embedding method for reversible audio watermarking

Qin, J., Lyu, S., Deng, J., Liang, X., Xiang, S., and Chen, H. A lattice-based embedding method for reversible audio watermarking. arXiv preprint 2209.07066, 2022. URL http://arxiv.org/pdf/2209.07066v1

work page arXiv 2022

[44] [44]

T., Ashkinaze, J., Eaton, A

Qiwei, L., Zhang, S., Kasper, A. T., Ashkinaze, J., Eaton, A. A., Schoenebeck, S., and Gilbert, E. Reporting non-consensual intimate media: An audit study of deepfakes. arXiv preprint 2409.12138, 2024. URL http://arxiv.org/pdf/2409.12138v1

work page arXiv 2024

[45] [45]

I., and Bittner, R

Rafii, Z., Liutkus, A., St \"o ter, F.-R., Mimilakis, S. I., and Bittner, R. The MUSDB18 corpus for music separation, December 2017. URL https://doi.org/10.5281/zenodo.1117372

work page doi:10.5281/zenodo.1117372 2017

[46] [46]

Copyright protection in generative ai: A technical perspective

Ren, J., Xu, H., He, P., Cui, Y., Zeng, S., Zhang, J., Wen, H., Ding, J., Huang, P., Lyu, L., Liu, H., Chang, Y., and Tang, J. Copyright protection in generative ai: A technical perspective. arXiv preprint 2402.02333, 2024. URL http://arxiv.org/pdf/2402.02333v2

work page arXiv 2024

[47] [47]

W., Beerends, J

Rix, A. W., Beerends, J. G., Hollier, M. P., and Hekstra, A. P. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), volume 2, pp.\ 749--752. IEEE, 2001

work page 2001

[48] [48]

Will umg's lawsuit against believe usher in a new era of regulation for diy distribution? Billboard, November 2024

Robinson, K. Will umg's lawsuit against believe usher in a new era of regulation for diy distribution? Billboard, November 2024. URL https://www.billboard.com/pro/umg-lawsuit-believe-regulation-diy-distribution/

work page 2024

[49] [49]

Proactive detection of voice cloning with localized watermarking

San Roman, R., Fernandez, P., Elsahar, H., D \'e fossez, A., Furon, T., and Tran, T. Proactive detection of voice cloning with localized watermarking. In International Conference on Machine Learning, volume 235, 2024

work page 2024

[50] [50]

Method for the subjective assessment of intermediate quality level of audio systems

Series, B. Method for the subjective assessment of intermediate quality level of audio systems. Technical report, International Telecommunication Union Radiocommunication Assembly, 2014

work page 2014

[51] [51]

Series, B. S. Algorithms to measure audio programme loudness and true-peak audio level. In International Telecommunication Union Radiocommunication Assembly, 2011

work page 2011

[52] [52]

R., Wang, Z., Ahvanooey, M

Shoaib, M. R., Wang, Z., Ahvanooey, M. T., and Zhao, J. Deepfakes, misinformation, and disinformation in the era of frontier ai, generative ai, and large ai models. arXiv preprint 2311.17394, 2023. URL http://arxiv.org/pdf/2311.17394v1

work page arXiv 2023

[53] [53]

Teen girls confront an epidemic of deepfake nudes in schools

Singer, N. Teen girls confront an epidemic of deepfake nudes in schools. The New York Times, April 2024. URL https://www.nytimes.com/2024/04/08/technology/deepfake-ai-nudes-westfield-high-school.html. Retrieved April 8, 2024

work page 2024

[54] [54]

H., Hendriks, R

Taal, C. H., Hendriks, R. C., Heusdens, R., and Jensen, J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.\ 4214--4217. IEEE, 2010

work page 2010

[55] [55]

Attention is all you need

Vaswani, A. Attention is all you need. Advances in Neural Information Processing Systems, 2017

work page 2017

[56] [56]

Ai is spawning a flood of fake trump and harris voices

Verma, P., Tenjarla, R., and Sand, B. Ai is spawning a flood of fake trump and harris voices. here's how to tell what's real. The Washington Post, October 2024. Retrieved 6:05 a.m

work page 2024

[57] [57]

Wme announces first major ai partnership with authenticated ai company vermillio to protect artists and create new revenue opportunities

Vermillio. Wme announces first major ai partnership with authenticated ai company vermillio to protect artists and create new revenue opportunities. Vermillio Blog, January 2024

work page 2024

[58] [58]

Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation

Wang, C., Riviere, M., Lee, A., Wu, A., Talnikar, C., Haziza, D., Williamson, M., Pino, J., and Dupoux, E. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint arXiv:2101.00390, 2021

work page arXiv 2021

[59] [59]

hello, it's me

Wenger, E., Bronckers, M., Cianfarani, C., Cryan, J., Sha, A., Zheng, H., and Zhao, B. Y. "hello, it's me": Deep learning-based speech synthesis attacks in the real world. arXiv preprint 2109.09598, 2021. URL http://arxiv.org/pdf/2109.09598v1

work page arXiv 2021

[60] [60]

audiowmark: Audio watermarking

Westerfeld, S. audiowmark: Audio watermarking. https://github.com/swesterfeld/audiowmark, 2020. Accessed: 2025-01-02

work page 2020

[61] [61]

Yang, P., Ci, H., Song, Y., and Shou, M. Z. Can simple averaging defeat modern watermarks? In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[62] [62]

Youden, W. J. Index for rating diagnostic tests. Cancer, 3 0 (1): 0 32--35, 1950

work page 1950

[63] [63]

A time-frequency perspective on audio watermarking

Zhang, H. A time-frequency perspective on audio watermarking. arXiv preprint 2002.03156, 2020. URL http://arxiv.org/pdf/2002.03156v1

work page arXiv 2002

[64] [64]

Robust Audio Watermarking Algorithm Based on Moving Average and DCT

Zhang, J. and Han, B. Robust audio watermarking algorithm based on moving average and dct. arXiv preprint 1704.02755, 2017. URL http://arxiv.org/pdf/1704.02755v1

work page internal anchor Pith review Pith/arXiv arXiv 2017

[65] [65]

and Fastl, H

Zwicker, E. and Fastl, H. Psychoacoustics: Facts and Models. Springer Science & Business Media, mar 2013

work page 2013

[66] [66]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[67] [67]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[68] [68]

Adding Conditional Control to Text-to-Image Diffusion Models

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3240323.3241729 2023