arxiv: 2605.01515 · v1 · submitted 2026-05-02 · 💻 cs.SD · cs.CR

Recognition: unknown

MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech

Yutong Jin , Qi Li , Lingshuang Liu , Jianbing Ni

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:01 UTC · model grok-4.3

classification 💻 cs.SD cs.CR

keywords audio watermarkingMel-spectrogramtext-to-speech synthesisAI-generated speechprovenance attributionspread-spectrum embeddingcopyright protectionrobust watermark extraction

0 comments

The pith

MelShield embeds keyed binary payloads as low-energy perturbations in the Mel-spectrogram before vocoder synthesis to enable reliable attribution of AI-generated speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MelShield as a watermarking method that adds traceable signals to synthesized audio at the Mel-spectrogram stage of text-to-speech pipelines. It uses spread-spectrum techniques to place short binary identifiers across selected time-frequency regions so the marks survive later waveform generation and common processing steps. Because the embedding happens before the vocoder runs, existing models such as DiffWave and HiFi-GAN require no retraining or architectural changes. The keyed design supports multiple users while restricting who can decode the payload, and experiments show extraction accuracy stays near 100 percent even after compression or added noise. A sympathetic reader would care because the approach offers a practical way to label and trace machine-generated audio without sacrificing quality or forcing changes to generation systems.

Core claim

MelShield treats the intermediate Mel-spectrogram as the host signal and embeds a short binary payload via low-energy, keyed spread-spectrum perturbations distributed across carefully selected time-frequency regions prior to waveform synthesis, enabling plug-and-play watermarking for Mel-conditioned TTS architectures without requiring changes to the vocoder.

What carries the argument

Keyed spread-spectrum perturbation embedding performed on the Mel-spectrogram, which distributes the binary payload across chosen time-frequency bins so the marks remain extractable after synthesis and distortion.

If this is right

Watermark extraction reaches near-100 percent bit accuracy after common distortions such as compression and additive noise.
The same embedding step works across different Mel-conditioned vocoders without retraining or architectural modification.
Multi-user keyed construction supports scalable attribution while limiting unauthorized extraction.
Perceptual audio quality remains high because the perturbations are low-energy and confined to selected regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be adapted to other Mel-based audio generators beyond speech, such as music or sound-effect models.
If extraction stays reliable under more aggressive attacks, regulators could require similar marks for public disclosure of synthetic media.
Combining the keyed payload with existing metadata standards might create a layered provenance system that survives re-encoding.
Testing on longer utterances or streaming synthesis would reveal whether the time-frequency selection strategy scales without quality loss.

Load-bearing premise

Low-energy keyed perturbations placed in the Mel-spectrogram will survive vocoder synthesis and everyday signal distortions without noticeably reducing audio quality or forcing changes to the TTS model.

What would settle it

Run the extraction detector on audio produced by a Mel-conditioned TTS model after standard MP3 compression at 128 kbps or additive white noise at 20 dB SNR and measure whether bit accuracy falls substantially below the reported near-100 percent level.

Figures

Figures reproduced from arXiv: 2605.01515 by Jianbing Ni, Lingshuang Liu, Qi Li, Yutong Jin.

**Figure 2.** Figure 2: Workflow of Audio Watermark Embedding and Verification [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Bit-wise accuracy distributions of the key-conditioned verification statistic [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 5.** Figure 5: HiFi-GAN: Maximum watermark payload capacity under a perceptual quality constraint (PESQ ≥ 3.5). sis H1 (watermarked audio evaluated with the correct key), S concentrates near 1.00, indicating reliable watermark recovery. In contrast, under the null hypothesis H0, including both unwatermarked audio and watermarked audio evaluated with an incorrect key, S concentrates around 0.5 and exhibits an approxima… view at source ↗

**Figure 6.** Figure 6: DiffWave: Alpha sweep under different payload capacities. Each subfigure [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: HiFi-GAN: Alpha sweep under different payload capacities. Each sub [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

In this paper, we propose MelShield, a robust, in-generation, keyed audio watermarking framework that embeds identifiable signals into AI-generated audio for copyright protection and reliable attribution. Specifically, MelShield operates in the Mel-spectrogram domain during the generation process, targeting intermediate acoustic representations in Mel-conditioned pipelines for text-to-speech (TTS) generation. The core idea is to treat the intermediate Mel-spectrogram as the host signal and embed a short binary payload via low-energy, keyed spread-spectrum perturbations distributed across carefully selected time-frequency regions prior to waveform synthesis. By performing watermarking before vocoder inference, MelShield remains plug-and-play for Mel-conditioned TTS architectures and does not require modification or retraining of the underlying TTS generation vocoder, such as DiffWave and HiFi-GAN. Moreover, the multi-user keyed construction enables scalable user-specific attribution, while the keyed verification mechanism limits unauthorized decoding, thereby reducing the risk of large-scale extractor probing and adversarial analysis. Extensive experiments on DiffWave and HiFi-GAN demonstrate that MelShield achieves reliable watermark extraction, approaching 100\% bit accuracy, even under signal distortions, e.g., compression and additive noise, while preserving high perceptual audio quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MelShield embeds keyed spread-spectrum watermarks in the Mel-spectrogram before vocoder synthesis for TTS provenance, which is a clean plug-and-play move, but the robustness claims rest on unshown details about perturbation survival and quality cost.

read the letter

MelShield adds the watermark during generation by perturbing the intermediate Mel-spectrogram with low-energy keyed spread-spectrum signals, then lets the existing vocoder produce the final waveform. This keeps the method compatible with standard Mel-conditioned TTS models like DiffWave and HiFi-GAN without any retraining or architecture changes, and the keyed design supports multiple users while limiting who can extract the payload. That combination is the actual new piece: prior audio watermarking often works on the final waveform or requires model changes, whereas this targets the spectrogram stage specifically for attribution of synthesized speech. The paper does a solid job laying out the practical need for provenance in generative audio and showing how the in-generation timing avoids post-hoc tampering issues. The multi-user keyed verification is also a reasonable engineering choice for scalable deployment. The soft spot is the central assumption that these small perturbations will survive the non-linear vocoder mapping and remain detectable at near-100% accuracy after common distortions like compression or additive noise. Vocoders are tuned for perceptual naturalness, so low-energy time-frequency changes can easily get smoothed or attenuated, and the abstract supplies no perturbation energy levels, no Mel reconstruction error after synthesis, no quality metrics such as PESQ or MOS, and no ablation on how accuracy trades off against perceptual quality. Without those numbers the high-accuracy claim cannot be evaluated. This paper is aimed at engineers working on audio security, copyright enforcement, or generative media pipelines who need a deployable attribution tool rather than a theoretical advance. A reader who wants to test or extend an in-generation watermarking scheme would get concrete value from the framework and the claimed experiments. The construction itself uses standard spread-spectrum techniques, so the math looks reproducible if the code and exact parameters are released. I would send it to peer review because the idea is focused and the experiments are described as extensive; referees can verify whether the numbers actually support the robustness statements.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MelShield, a keyed audio watermarking framework that embeds short binary payloads as low-energy spread-spectrum perturbations directly into the Mel-spectrogram of Mel-conditioned TTS pipelines (e.g., DiffWave, HiFi-GAN) prior to vocoder synthesis. The method is presented as plug-and-play, supporting multi-user attribution via keyed verification while claiming near-100% bit-extraction accuracy under common distortions such as compression and additive noise, without perceptible quality loss or TTS model retraining.

Significance. If the robustness and quality claims hold, the work would offer a practical, non-intrusive solution for provenance attribution of AI-generated speech, addressing a timely need for copyright protection and scalable user-specific tracing. The in-generation, keyed design and compatibility with existing vocoders are notable strengths that could facilitate adoption.

major comments (2)

Abstract: the central claim of 'approaching 100% bit accuracy' under distortions is stated without any quantitative tables, error bars, exact distortion parameters (e.g., SNR levels, compression bitrates), baseline comparisons, or statistical tests, preventing verification of the reported performance.
Method and Experiments: the load-bearing assumption that low-energy keyed perturbations survive the non-linear vocoder mapping (HiFi-GAN, DiffWave) and remain extractable at high accuracy is not supported by reported ablation studies on perturbation energy levels, post-vocoder Mel reconstruction error, or accuracy-quality trade-offs.

minor comments (2)

Abstract: the phrase 'approaching 100% bit accuracy' is imprecise; specific percentages, conditions, and confidence intervals should be provided.
The description of 'carefully selected time-frequency regions' lacks detail on selection criteria or how they are determined from the payload and key.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We have addressed each major comment point by point below, with revisions made to improve clarity, verifiability, and empirical support where appropriate.

read point-by-point responses

Referee: Abstract: the central claim of 'approaching 100% bit accuracy' under distortions is stated without any quantitative tables, error bars, exact distortion parameters (e.g., SNR levels, compression bitrates), baseline comparisons, or statistical tests, preventing verification of the reported performance.

Authors: We agree that the abstract presents the performance claim at a high level and would benefit from greater specificity to enable direct verification. In the revised manuscript, we have updated the abstract to include references to the quantitative results from our experiments (e.g., bit accuracy under specific compression bitrates and SNR levels for additive noise), along with pointers to the tables, figures, error bars, baseline comparisons, and statistical tests provided in the main text. This revision ensures the claim is grounded and verifiable without changing the underlying experimental findings. revision: yes
Referee: Method and Experiments: the load-bearing assumption that low-energy keyed perturbations survive the non-linear vocoder mapping (HiFi-GAN, DiffWave) and remain extractable at high accuracy is not supported by reported ablation studies on perturbation energy levels, post-vocoder Mel reconstruction error, or accuracy-quality trade-offs.

Authors: The manuscript reports extensive end-to-end experiments on DiffWave and HiFi-GAN demonstrating high post-vocoding extraction accuracy under distortions, which empirically indicates that the low-energy perturbations survive the non-linear vocoder mapping. We acknowledge, however, that dedicated ablation studies on perturbation energy levels, post-vocoder Mel reconstruction error, and accuracy-quality trade-offs would provide more direct support for this assumption. We have therefore added these ablation studies to the revised manuscript (new subsection in Experiments), including sweeps of energy levels, reconstruction error metrics, and trade-off curves, to strengthen the evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: MelShield's embedding construction and reported extraction accuracies are independent empirical results

full rationale

The paper presents a plug-and-play watermarking method that adds keyed spread-spectrum perturbations to the Mel-spectrogram prior to vocoder synthesis, then validates extraction bit accuracy under distortions via experiments on DiffWave and HiFi-GAN. No equations, fitted parameters, or self-citations are shown that reduce the claimed near-100% accuracy or robustness to a definition or input by construction. The central claims rest on external experimental validation rather than self-referential derivations, satisfying the criteria for a self-contained non-circular result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract supplies limited technical detail; the approach rests on standard signal-processing assumptions and one domain assumption about Mel-conditioned TTS pipelines. No invented entities are introduced.

free parameters (1)

perturbation energy level
Low-energy perturbations must be tuned to balance robustness against quality loss; specific value or selection rule not stated in abstract.

axioms (1)

domain assumption Spread-spectrum embedding can survive subsequent vocoding and common audio distortions when applied to Mel-spectrograms
Invoked by the claim that watermarking before vocoder inference remains effective and plug-and-play.

pith-pipeline@v0.9.0 · 5525 in / 1285 out tokens · 96232 ms · 2026-05-10T16:01:30.789642+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 10 canonical work pages · 2 internal anchors

[1]

Wavmark: Watermarking for audio generation,

Chen, G., Wu, Y., Liu, S., Liu, T., Du, X., Wei, F.: Wavmark: Watermarking for audio generation. arXiv preprint arXiv:2308.12770 (2023)

work page arXiv 2023
[2]

https://keithito.com/LJ-Speech- Dataset/ (2017)

Ito, K., Johnson, L.: The lj speech dataset. https://keithito.com/LJ-Speech- Dataset/ (2017)

2017
[3]

Jia, Y., Zhang, Y., Weiss, R., Wang, Q., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Proc. of NeurIPS31(2018)

2018
[4]

In: Proc

Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-tts: a generative flow for text-to-speech via monotonic alignment search. In: Proc. of NeurIPS (2020)

2020
[5]

Source tracing of audio deepfake systems.arXiv preprint arXiv:2407.08016,

Klein,N.,Chen,T.,Tak,H.,Casal,R.,Khoury,E.:Sourcetracingofaudiodeepfake systems. arXiv preprint arXiv:2407.08016 (2024) 20 Y. Jin et al

work page arXiv 2024
[6]

Kong, J., Kim, J., Bae, J.: Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Proc. of NeurIPS33, 17022–17033 (2020)

2020
[7]

Diffwave: A versatile diffusion model for audio synthesis.arXiv preprint arXiv:2009.09761, 2020

Kong, Z., Ping, W., Huang, J., Zhao, K., Catanzaro, B.: Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761 (2020)

work page arXiv 2009
[8]

In: Proc

Lee, S.g., Ping, W., Ginsburg, B., Catanzaro, B., Yoon, S.: Bigvgan: A universal neural vocoder with large-scale training. In: Proc. of International Conference on Learning Representations (2023)

2023
[9]

Li, Q., Lin, X.: Proactive audio authentication using speaker identity watermark- ing. In: PST. pp. 1–10 (2024)

2024
[10]

In: Network and Distributed System Security Symposium (2024)

Liu, C., Zhang, J., Zhang, T., Yang, X., Zhang, W., Yu, N.: Detecting voice cloning attacks via timbre watermarking. In: Network and Distributed System Security Symposium (2024)

2024
[11]

In: Proc

Liu, W., Li, Y., Lin, D., Tian, H., Li, H.: Groot: Generating robust watermark for diffusion-model-based audio synthesis. In: Proc. of ACM MM (2024)

2024
[12]

Liu, Y., Lu, L., Jin, J., Sun, L., Fanelli, A.: Xattnmark: Learning robust audio watermarking with cross-attention (2025), https://arxiv.org/abs/2502.04230

work page arXiv 2025
[13]

In: Proc

Reddy, C.K.A., Gopal, V., Cutler, R.: Dnsmos: A non-intrusive perceptual objec- tive speech quality metric to evaluate noise suppressors. In: Proc. of ICASSP. pp. 6493–6497 (2021)

2021
[14]

Fastspeech 2: Fast and high-quality end-to-end text to speech,

Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., Liu, T.Y.: Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020)

work page arXiv 2006
[15]

In: Proc

Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In: Proc. of ICASSP. vol. 2, pp. 749–752 (2001)

2001
[16]

arXiv preprint arXiv:2401.17264 (2024)

Roman, R.S., Fernandez, P., Défossez, A., Furon, T., Tran, T., Elsahar, H.: Proactive detection of voice cloning with localized watermarking. arXiv preprint arXiv:2401.17264 (2024)

work page arXiv 2024
[17]

In: Proc

Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y.,Wang,Y.,Skerrv-Ryan,R.,etal.:Naturalttssynthesisbyconditioningwavenet on mel spectrogram predictions. In: Proc. of ICASSP. pp. 4779–4783 (2018)

2018
[18]

The Journal of the Acoustical Society of America 8(3), 185–190 (1937)

Stevens, S.S., Volkmann, J., Newman, E.B.: A scale for the measurement of the psychological magnitude pitch. The Journal of the Acoustical Society of America 8(3), 185–190 (1937)

1937
[19]

IEEE Transactions on Audio, Speech, and Language Processing19(7), 2125–2136 (2011)

Taal,C.H.,Hendriks,R.C.,Heusdens,R.,Jensen,J.:Analgorithmforintelligibility prediction of time-frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing19(7), 2125–2136 (2011)

2011
[20]

WaveNet: A Generative Model for Raw Audio

Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K., et al.: Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.0349912, 1 (2016)

work page internal anchor Pith review arXiv 2016
[21]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Wang, C., Chen, S., Wu, Y., Zhang, Z., Zhou, L., Liu, S., Chen, Z., Liu, Y., Wang, H., Li, J., He, L., Zhao, S., Wei, F.: Neural codec language models are zero-shot text to speech synthesizers (2023). https://doi.org/10.48550/arXiv.2301.02111

work page internal anchor Pith review doi:10.48550/arxiv.2301.02111 2023
[22]

Wen, Y., Innuganti, A., Ramos, A.B., Guo, H., Yan, Q.: Sok: How robust is audio watermarking in generative ai models? arXiv preprint arXiv:2503.19176 (2025)

work page arXiv 2025
[23]

Zhou, J., Yi, J., Wang, T., Tao, J., Bai, Y., Zhang, C.Y., Ren, Y., Wen, Z.: Tracea- blespeech: Towards proactively traceable text-to-speech with watermarking (2024), https://arxiv.org/abs/2406.04840

work page arXiv 2024