Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio

Georgios Milis; Heng Huang; Yihan Wu; Yubin Qin

arxiv: 2605.25967 · v1 · pith:RKZ33MVHnew · submitted 2026-05-25 · 💻 cs.LG · cs.SD

Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio

Georgios Milis , Yubin Qin , Yihan Wu , Heng Huang This is my paper

Pith reviewed 2026-06-29 23:06 UTC · model grok-4.3

classification 💻 cs.LG cs.SD

keywords watermarkingsynthetic audiogradient-freecommunity detectiontoken redundancyautoregressive modelscontent provenancediscrete tokens

0 comments

The pith

Reducing the audio tokenizer vocabulary via community detection on redundant tokens creates a gradient-free watermark with orders-of-magnitude higher detectability and built-in robustness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard inference-time watermarks fail on continuous audio because of discretization inconsistencies, while finetuning tokenizers removes the training-free advantage. By theoretically analyzing how token errors degrade detection and then applying community detection to identify and collapse redundant tokens into a smaller vocabulary, the method mitigates those errors at inference time. This produces a watermark that remains fully gradient-free yet shows dramatically improved detection rates and resistance to common audio alterations. A sympathetic reader would care because the technique supplies a simple, no-training way to mark generated audio for provenance tracking.

Core claim

Motivated by vocabulary redundancy in discretization, the authors reduce the effective vocabulary through community detection on token redundancy; this reduction mitigates the impact of token errors on watermark detection, enabling a gradient-free method that boosts detectability by several orders of magnitude while providing built-in robustness to audio modifications and establishing a new state-of-the-art for token-level watermarks that arises directly from the nature of discrete representation learning.

What carries the argument

The reduced vocabulary obtained via community detection on token redundancy, which mitigates the impact of token errors on watermark detection.

If this is right

Watermark detectability increases by several orders of magnitude compared with prior token-level methods.
The watermark exhibits built-in robustness to audio modifications without any additional mechanisms.
The entire procedure remains gradient-free and requires no finetuning of the modality tokenizer.
The resulting performance sets a new state-of-the-art for token-level watermarks across multimedia modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same vocabulary-reduction step may transfer to other discretized continuous signals such as video frames or sensor data.
Further gains could appear if future tokenizers are designed with explicit redundancy minimization rather than post-hoc community detection.
Because the improvement stems from properties of discrete representations, advances in representation learning may automatically strengthen watermarking performance.

Load-bearing premise

Reducing the vocabulary via community detection on token redundancy effectively mitigates the impact of token errors on watermark detection.

What would settle it

Running the watermark detection test on the same set of generated audio clips before and after the vocabulary reduction, then applying standard modifications such as compression or additive noise, and observing whether the detection rate fails to rise by multiple orders of magnitude.

Figures

Figures reproduced from arXiv: 2605.25967 by Georgios Milis, Heng Huang, Yihan Wu, Yubin Qin.

**Figure 1.** Figure 1: Illustration of a token-level watermarking mechanism in the audio domain. During generation, the autoregressive model computes a probability distribution over the vocabulary at each time step. A logit bias is pseudorandomly applied to a specific subset of tokens, encouraging their selection, and the resulting token sequence is synthesized into a waveform by the decoder D. For detection, the waveform is re-… view at source ↗

**Figure 2.** Figure 2: Illustration of how our method captures and explicitly mitigates the retokenization errors. First, we use the encoder and decoder modules from the codec of interest to encode, decode, and re-encode a dataset of waveforms (top). We use the confusion counts between tokens as edge weights in a graph where the vertices correspond to tokens. Then, we perform community detection on that graph, effectively redu… view at source ↗

**Figure 4.** Figure 4: Even with h > 0, our watermark still achieves unprecedented detectability even in very low FPR settings. Experiments with the Moshi model with h = 1 (top) and h = 2 (bottom), both prompted by conversational audio. use different audio codecs for discretization. First, we test Moshi (Defossez et al. ´ , 2024) which uses the Mimi codec and is highly capable at conversational speech. Second, we test the music… view at source ↗

**Figure 5.** Figure 5: Experiments with the Moshi model with h = 1 (top) and h = 2 (bottom), both prompted by LibriSpeech samples. as a sequence-to-sequence audio task. 5.1. Detectability We evaluate detectability by showing the true positive rate (TPR) by thresholding the p-values at a desired false positive rate (FPR). This allows us to visualize the sensitivity of the watermark in very low FPR scenarios. We present some resul… view at source ↗

**Figure 6.** Figure 6: Clustering effectiveness of the Leiden community detection method for different hyperparameter sweeps. As expected, the cluster preservation is much stronger than the baseline token preservation r. We observe that the effectiveness increases in small resolutions (where larger clusters are encouraged, thus fewer chances of inter-cluster error), and in small noise thresholds (meaning that even rare token sub… view at source ↗

**Figure 7.** Figure 7: Experiments with the MusicGen model with h = 0, prompted by captions describing music. Our proposed method is still superior to the baselines despite the different architecture and task [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

As policy catches up with the capabilities of generative AI, watermarking is central to content provenance efforts. Inference-time watermarks for autoregressive models are unfit for continuous modalities due to discretization inconsistencies. Existing methods overcome this by finetuning the modality tokenizers, nullifying the watermark's training-free advantage. In this work, motivated by the vocabulary redundancy of discretization, we propose an elegant solution for powerful and robust watermarking of synthetic audio. We theoretically analyze the impact of token errors on watermark detection, and effectively mitigate them using a reduced vocabulary obtained via community detection. Thorough experiments showcase that our gradient-free method can boost detectability by several orders of magnitude, while also achieving built-in robustness to audio modifications. Broadly, we discover a new state-of-the-art for token-level watermarks in multimedia, which simply arises from the nature of discrete representation learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The community-detection reduction for token-error mitigation in audio watermarking is a clean idea but the paper gives almost no concrete evidence that it actually produces the claimed orders-of-magnitude detectability gain.

read the letter

The main takeaway is a training-free audio watermark that shrinks the tokenizer vocabulary via community detection on token redundancy, then uses the smaller set for detection. The authors argue this sidesteps the usual need to finetune the tokenizer and gives built-in robustness to waveform edits.

What the work does cleanly is name the discretization inconsistency problem for continuous modalities and point out that vocabulary redundancy is under-used. The theoretical section on how token errors affect detection statistics is a useful framing, and treating the problem as a graph where communities capture substitutable tokens is a reasonable move.

The soft spot is exactly where the stress-test note flags: the paper never shows that the communities recovered by the detection algorithm line up with the actual token-substitution patterns that audio modifications induce. If the communities are just grouping tokens that are similar in embedding space but not in the error distribution under compression or additive noise, the theoretical improvement does not transfer and the empirical boost cannot be credited to the mechanism. The abstract mentions “thorough experiments” and “several orders of magnitude,” yet supplies no baselines, no error bars, no ablation on the community-detection parameters, and no check on whether the error model is independent or correlated. Without those, the central claim stays unverified.

This is aimed at people building practical provenance tools for generative audio. A reader already working on token-level watermarks might borrow the community-detection step, but the current manuscript does not give enough to reproduce or trust the performance numbers. I would not bring it to a reading group in its present form and would not cite it. It does not yet merit sending to referees; the load-bearing assumption about error alignment needs to be demonstrated first.

Referee Report

2 major / 1 minor

Summary. The manuscript claims a gradient-free watermark for synthetic audio that exploits vocabulary redundancy via community detection to produce a reduced vocabulary. It theoretically analyzes token-error impact on detection and mitigates it through this reduction, yielding orders-of-magnitude detectability gains plus built-in robustness to audio edits, all without tokenizer finetuning.

Significance. If the central mechanism holds, the result would be significant: it preserves the training-free property of inference-time watermarks while extending them to continuous modalities, potentially establishing a new baseline for token-level multimedia watermarking by directly using properties of discrete representation learning.

major comments (2)

[Abstract / theoretical analysis] Abstract (theoretical analysis paragraph): the claim that community detection on token redundancy produces a reduced vocabulary whose error statistics enable several-orders-of-magnitude detectability gains is load-bearing. The argument requires that the communities align with the actual substitution patterns induced by waveform modifications; if the detected communities instead reflect only static token co-occurrence rather than edit-induced errors, the theoretical mitigation does not transfer and the empirical boost cannot be attributed to the proposed mechanism.
[Abstract / experiments] Abstract (experiments paragraph): the reported orders-of-magnitude gains are presented without reference to error bars, number of trials, or explicit comparison against the strongest existing token-level baselines under identical audio-edit conditions. This leaves open whether post-hoc vocabulary choices or dataset-specific redundancy drive the result rather than the community-detection step itself.

minor comments (1)

[Abstract] Abstract: the sentence 'Broadly, we discover a new state-of-the-art' should be replaced by a precise statement of the quantitative improvement relative to prior token-level methods, with the supporting table or figure cited.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, providing clarifications and committing to revisions that strengthen the presentation of the theoretical mechanism and experimental results.

read point-by-point responses

Referee: [Abstract / theoretical analysis] Abstract (theoretical analysis paragraph): the claim that community detection on token redundancy produces a reduced vocabulary whose error statistics enable several-orders-of-magnitude detectability gains is load-bearing. The argument requires that the communities align with the actual substitution patterns induced by waveform modifications; if the detected communities instead reflect only static token co-occurrence rather than edit-induced errors, the theoretical mitigation does not transfer and the empirical boost cannot be attributed to the proposed mechanism.

Authors: We appreciate this important clarification request. Community detection is performed on pairwise token embedding similarities from the audio tokenizer, which encode acoustic redundancies; these similarities correlate with substitution patterns under waveform edits because edits (e.g., noise, compression) induce confusions primarily among acoustically similar tokens. The theoretical analysis then shows how the reduced vocabulary lowers effective token-error rates. We will revise the manuscript to include an explicit discussion of this alignment, supported by a new analysis comparing detected communities against empirical substitution matrices obtained from edited audio samples. revision: partial
Referee: [Abstract / experiments] Abstract (experiments paragraph): the reported orders-of-magnitude gains are presented without reference to error bars, number of trials, or explicit comparison against the strongest existing token-level baselines under identical audio-edit conditions. This leaves open whether post-hoc vocabulary choices or dataset-specific redundancy drive the result rather than the community-detection step itself.

Authors: We agree that additional statistical detail and controls are warranted. The revised manuscript will report all detection metrics with error bars computed over multiple independent trials (specifying the exact number of runs), and will add head-to-head comparisons against the strongest published token-level audio watermarking baselines under identical audio-edit conditions. These additions will help isolate the contribution of the community-detection procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external algorithms and independent theoretical analysis

full rationale

The paper motivates its method from vocabulary redundancy in discretization (an external property of tokenizers) and mitigates token errors via community detection (a standard graph algorithm) after a theoretical analysis of error impact. No equations or claims reduce a prediction or result to a fitted parameter or self-definition by construction. No load-bearing self-citations or uniqueness theorems from the authors are invoked; the central detectability claim is presented as an empirical and theoretical consequence of the proposed reduction step rather than a renaming or tautology. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that token vocabularies in audio discretization contain exploitable redundancy.

axioms (1)

domain assumption The token vocabulary of audio discretization contains sufficient redundancy that community detection can produce a reduced set mitigating token errors.
Invoked to justify the core mitigation strategy and theoretical analysis of error impact.

pith-pipeline@v0.9.1-grok · 5681 in / 1050 out tokens · 26616 ms · 2026-06-29T23:06:04.296858+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 15 canonical work pages · 7 internal anchors

[1]

MusicLM: Generating Music From Text

URL https://scottaaronson. blog/?p=6823. Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., Sharifi, M., Zeghidour, N., and Frank, C. Musiclm: Generating music from text.arXiv preprint arXiv:2301.11325,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

D., Guillaume, J.-L., Lambiotte, R., and Lefeb- vre, E

Blondel, V . D., Guillaume, J.-L., Lambiotte, R., and Lefeb- vre, E. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008,

2008
[3]

Audiolm: a lan- guage modeling approach to audio generation.arXiv preprint arXiv:2209.03143,

Borsos, Z., Marinier, R., Vincent, D., Kharitonov, E., Pietquin, O., Sharifi, M., Teboul, O., Grangier, D., Tagliasacchi, M., and Zeghidour, N. Audiolm: a lan- guage modeling approach to audio generation.arXiv preprint arXiv:2209.03143,

work page arXiv
[4]

Wavmark: Watermarking for audio generation.arXiv preprint arXiv:2308.12770,

Chen, G., Wu, Y ., Liu, S., Liu, T., Du, X., and Wei, F. Wavmark: Watermarking for audio generation.arXiv preprint arXiv:2308.12770,

work page arXiv
[5]

K., Sch¨uldt, C., and Chatterjee, S

Cumlin, F., Liang, X., Ungureanu, V ., Reddy, C. K., Sch¨uldt, C., and Chatterjee, S. Dnsmos pro: A reduced-size dnn for probabilistic mos of speech. In25th Interspeech Con- ferece 2024, Kos Island, Greece, Sep 1 2024-Sep 5 2024, pp. 4818–4822. International Speech Communication As- sociation,

2024
[6]

Moshi: a speech-text foundation model for real-time dialogue

D´efossez, A., Mazar ´e, L., Orsini, M., Royer, A., P ´erez, P., J ´egou, H., Grave, E., and Zeghidour, N. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Kimi-Audio Technical Report

Ding, D., Ju, Z., Leng, Y ., Liu, S., Liu, T., Shang, Z., Shen, K., Song, W., Tan, X., Tang, H., et al. Kimi-audio techni- cal report.arXiv preprint arXiv:2504.18425,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Du, Z., Gao, C., Wang, Y ., Yu, F., Zhao, T., Wang, H., Lv, X., Wang, H., Shi, X., An, K., et al. Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training.arXiv preprint arXiv:2505.17589,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Unbiased watermark for large language models

Hu, Z., Chen, L., Wu, X., Wu, Y ., Zhang, H., and Huang, H. Unbiased watermark for large language models. In International Conference on Learning Representations, volume 2024, pp. 45408–45436,

2024
[10]

Liu, A., Pan, L., Hu, X., Li, S., Wen, L., King, I., and Yu, P. S. An unforgeable publicly verifiable watermark for large language models.arXiv preprint arXiv:2307.16230, 2023a. Liu, A., Pan, L., Hu, X., Meng, S., and Wen, L. A semantic invariant robust watermark for large language models. arXiv preprint arXiv:2310.06356, 2023b. Liu, C., Zhang, J., Zhang,...

work page arXiv
[11]

Mittag, G., Naderi, B., Chehadi, A., and M¨oller, S

doi: 10.14722/ndss.2024.24200. Mittag, G., Naderi, B., Chehadi, A., and M¨oller, S. Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. In Proc. Interspeech 2021, pp. 2127–2131,

work page doi:10.14722/ndss.2024.24200 2024
[12]

Code drift: Towards idempotent neural audio codecs

O’Reilly, P., Seetharaman, P., Su, J., Jin, Z., and Pardo, B. Code drift: Towards idempotent neural audio codecs. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

2025
[13]

Robust Speech Recognition via Large-Scale Weak Supervision

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. Robust speech recognition via large-scale weak supervision. arxiv 2022.arXiv preprint arXiv:2212.04356, 10,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Llama- mimi: Exploring the limits of flattened speech language modeling.arXiv preprint arXiv:2509.14882v2,

Sugiura, I., Kurita, S., Oda, Y ., and Higashinaka, R. Llama- mimi: Exploring the limits of flattened speech language modeling.arXiv preprint arXiv:2509.14882v2,

work page arXiv
[15]

Training-free watermarking for autoregressive image generation.arXiv preprint arXiv:2505.14673,

Tong, Y ., Pan, Z., Yang, S., and Zhou, K. Training-free watermarking for autoregressive image generation.arXiv preprint arXiv:2505.14673,

work page arXiv
[16]

Emu3: Next-Token Prediction is All You Need

Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y ., Wang, J., Zhang, F., Wang, Y ., Li, Z., Yu, Q., et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

URL https://arxiv.org/ abs/2503.01710. Wu, S., Liu, J., Huang, Y ., Guan, H., and Zhang, S. Adver- sarial audio watermarking: Embedding watermark into deep feature. In2023 IEEE International Conference on Multimedia and Expo (ICME), pp. 61–66. IEEE,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Distortion-free watermarks are not truly distortion-free under watermark key collisions.arXiv preprint arXiv:2406.02603,

Wu, Y ., Chen, R., Hu, Z., Chen, Y ., Guo, J., Zhang, H., and Huang, H. Distortion-free watermarks are not truly distortion-free under watermark key collisions.arXiv preprint arXiv:2406.02603,

work page arXiv
[19]

A watermark for auto-regressive speech generation models

Wu, Y ., Chen, R., Milis, G., Guo, J., and Huang, H. A watermark for auto-regressive speech generation models. InProc. Interspeech, pp. 3474–3478, 2025a. Wu, Y ., Milis, G., Chen, R., and Huang, H. Robust distortion-free watermark for autoregressive audio gen- eration models.The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025...

2019
[20]

Provable robust watermarking for ai-generated text.arXiv preprint arXiv:2306.17439,

Zhao, X., Ananth, P., Li, L., and Wang, Y .-X. Provable robust watermarking for ai-generated text.arXiv preprint arXiv:2306.17439,

work page arXiv
[21]

Extensibility Our vocabulary distillation method could in principle be applied to other multimodal autoregressive models, such as Emu3 (Wang et al., 2024)

10 Hidden in Plain Tokens A. Extensibility Our vocabulary distillation method could in principle be applied to other multimodal autoregressive models, such as Emu3 (Wang et al., 2024). Since we consider the RVQ architecture of modern audio codecs, where each channel typically has a much smaller vocabulary size than image discretizers, we hypothesize that ...

2024
[22]

Our proposed method is still superior to the baselines despite the different architecture and task

13 Hidden in Plain Tokens Figure 7.Experiments with the MusicGen model with h= 0 , prompted by captions describing music. Our proposed method is still superior to the baselines despite the different architecture and task. Table 7.Audio quality scores for audios generated by MusicGen with music prompts. FAD↓ h-gram Method VGGish CLAP None 0.247 0.039 h= 0 ...

2023
[23]

We use themedium sized checkpoint for generation, and finetune in the non-augmented setting

and LibriTTS (Zen et al., 2019), respectively. We use themedium sized checkpoint for generation, and finetune in the non-augmented setting. As watermarking parameters, we follow Jovanovi´c et al. (2025) and use γ= 0.25 everywhere and δ= 2 for Moshi and Spark-TTS. We empirically attenuated the watermark strength to achieve a better detectability-quality tr...

2019
[24]

(2025), we apply the watermark on the first 4 audio streams of Moshi, leaving the semantic (text) stream untouched

Jovanovi´c et al. (2025), we apply the watermark on the first 4 audio streams of Moshi, leaving the semantic (text) stream untouched. Similarly, we apply the watermark on all 4 streams of MusicGen. For the robustness experiments, we use the following diverse attack suite: •Speed perturbation:Resamples the audio to increase playback speed by a factor of 1....

2025
[25]

325 164 9.412 Base 1 2048 9.044 c= 1 (0.5,

2048
[26]

201 718 10.059 Base 1 2048 9.876 c= 2 (1.2,

2048
[27]

119 1540 13.391 Base 1 2048 12.099 (0.5,

2048
[28]

56 1628 15.843 Base 1 2048 15.230 (0.5,

2048

[1] [1]

MusicLM: Generating Music From Text

URL https://scottaaronson. blog/?p=6823. Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., Sharifi, M., Zeghidour, N., and Frank, C. Musiclm: Generating music from text.arXiv preprint arXiv:2301.11325,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

D., Guillaume, J.-L., Lambiotte, R., and Lefeb- vre, E

Blondel, V . D., Guillaume, J.-L., Lambiotte, R., and Lefeb- vre, E. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008,

2008

[3] [3]

Audiolm: a lan- guage modeling approach to audio generation.arXiv preprint arXiv:2209.03143,

Borsos, Z., Marinier, R., Vincent, D., Kharitonov, E., Pietquin, O., Sharifi, M., Teboul, O., Grangier, D., Tagliasacchi, M., and Zeghidour, N. Audiolm: a lan- guage modeling approach to audio generation.arXiv preprint arXiv:2209.03143,

work page arXiv

[4] [4]

Wavmark: Watermarking for audio generation.arXiv preprint arXiv:2308.12770,

Chen, G., Wu, Y ., Liu, S., Liu, T., Du, X., and Wei, F. Wavmark: Watermarking for audio generation.arXiv preprint arXiv:2308.12770,

work page arXiv

[5] [5]

K., Sch¨uldt, C., and Chatterjee, S

Cumlin, F., Liang, X., Ungureanu, V ., Reddy, C. K., Sch¨uldt, C., and Chatterjee, S. Dnsmos pro: A reduced-size dnn for probabilistic mos of speech. In25th Interspeech Con- ferece 2024, Kos Island, Greece, Sep 1 2024-Sep 5 2024, pp. 4818–4822. International Speech Communication As- sociation,

2024

[6] [6]

Moshi: a speech-text foundation model for real-time dialogue

D´efossez, A., Mazar ´e, L., Orsini, M., Royer, A., P ´erez, P., J ´egou, H., Grave, E., and Zeghidour, N. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Kimi-Audio Technical Report

Ding, D., Ju, Z., Leng, Y ., Liu, S., Liu, T., Shang, Z., Shen, K., Song, W., Tan, X., Tang, H., et al. Kimi-audio techni- cal report.arXiv preprint arXiv:2504.18425,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Du, Z., Gao, C., Wang, Y ., Yu, F., Zhao, T., Wang, H., Lv, X., Wang, H., Shi, X., An, K., et al. Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training.arXiv preprint arXiv:2505.17589,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Unbiased watermark for large language models

Hu, Z., Chen, L., Wu, X., Wu, Y ., Zhang, H., and Huang, H. Unbiased watermark for large language models. In International Conference on Learning Representations, volume 2024, pp. 45408–45436,

2024

[10] [10]

Liu, A., Pan, L., Hu, X., Li, S., Wen, L., King, I., and Yu, P. S. An unforgeable publicly verifiable watermark for large language models.arXiv preprint arXiv:2307.16230, 2023a. Liu, A., Pan, L., Hu, X., Meng, S., and Wen, L. A semantic invariant robust watermark for large language models. arXiv preprint arXiv:2310.06356, 2023b. Liu, C., Zhang, J., Zhang,...

work page arXiv

[11] [11]

Mittag, G., Naderi, B., Chehadi, A., and M¨oller, S

doi: 10.14722/ndss.2024.24200. Mittag, G., Naderi, B., Chehadi, A., and M¨oller, S. Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. In Proc. Interspeech 2021, pp. 2127–2131,

work page doi:10.14722/ndss.2024.24200 2024

[12] [12]

Code drift: Towards idempotent neural audio codecs

O’Reilly, P., Seetharaman, P., Su, J., Jin, Z., and Pardo, B. Code drift: Towards idempotent neural audio codecs. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

2025

[13] [13]

Robust Speech Recognition via Large-Scale Weak Supervision

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. Robust speech recognition via large-scale weak supervision. arxiv 2022.arXiv preprint arXiv:2212.04356, 10,

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

Llama- mimi: Exploring the limits of flattened speech language modeling.arXiv preprint arXiv:2509.14882v2,

Sugiura, I., Kurita, S., Oda, Y ., and Higashinaka, R. Llama- mimi: Exploring the limits of flattened speech language modeling.arXiv preprint arXiv:2509.14882v2,

work page arXiv

[15] [15]

Training-free watermarking for autoregressive image generation.arXiv preprint arXiv:2505.14673,

Tong, Y ., Pan, Z., Yang, S., and Zhou, K. Training-free watermarking for autoregressive image generation.arXiv preprint arXiv:2505.14673,

work page arXiv

[16] [16]

Emu3: Next-Token Prediction is All You Need

Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y ., Wang, J., Zhang, F., Wang, Y ., Li, Z., Yu, Q., et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

URL https://arxiv.org/ abs/2503.01710. Wu, S., Liu, J., Huang, Y ., Guan, H., and Zhang, S. Adver- sarial audio watermarking: Embedding watermark into deep feature. In2023 IEEE International Conference on Multimedia and Expo (ICME), pp. 61–66. IEEE,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Distortion-free watermarks are not truly distortion-free under watermark key collisions.arXiv preprint arXiv:2406.02603,

Wu, Y ., Chen, R., Hu, Z., Chen, Y ., Guo, J., Zhang, H., and Huang, H. Distortion-free watermarks are not truly distortion-free under watermark key collisions.arXiv preprint arXiv:2406.02603,

work page arXiv

[19] [19]

A watermark for auto-regressive speech generation models

Wu, Y ., Chen, R., Milis, G., Guo, J., and Huang, H. A watermark for auto-regressive speech generation models. InProc. Interspeech, pp. 3474–3478, 2025a. Wu, Y ., Milis, G., Chen, R., and Huang, H. Robust distortion-free watermark for autoregressive audio gen- eration models.The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025...

2019

[20] [20]

Provable robust watermarking for ai-generated text.arXiv preprint arXiv:2306.17439,

Zhao, X., Ananth, P., Li, L., and Wang, Y .-X. Provable robust watermarking for ai-generated text.arXiv preprint arXiv:2306.17439,

work page arXiv

[21] [21]

Extensibility Our vocabulary distillation method could in principle be applied to other multimodal autoregressive models, such as Emu3 (Wang et al., 2024)

10 Hidden in Plain Tokens A. Extensibility Our vocabulary distillation method could in principle be applied to other multimodal autoregressive models, such as Emu3 (Wang et al., 2024). Since we consider the RVQ architecture of modern audio codecs, where each channel typically has a much smaller vocabulary size than image discretizers, we hypothesize that ...

2024

[22] [22]

Our proposed method is still superior to the baselines despite the different architecture and task

13 Hidden in Plain Tokens Figure 7.Experiments with the MusicGen model with h= 0 , prompted by captions describing music. Our proposed method is still superior to the baselines despite the different architecture and task. Table 7.Audio quality scores for audios generated by MusicGen with music prompts. FAD↓ h-gram Method VGGish CLAP None 0.247 0.039 h= 0 ...

2023

[23] [23]

We use themedium sized checkpoint for generation, and finetune in the non-augmented setting

and LibriTTS (Zen et al., 2019), respectively. We use themedium sized checkpoint for generation, and finetune in the non-augmented setting. As watermarking parameters, we follow Jovanovi´c et al. (2025) and use γ= 0.25 everywhere and δ= 2 for Moshi and Spark-TTS. We empirically attenuated the watermark strength to achieve a better detectability-quality tr...

2019

[24] [24]

(2025), we apply the watermark on the first 4 audio streams of Moshi, leaving the semantic (text) stream untouched

Jovanovi´c et al. (2025), we apply the watermark on the first 4 audio streams of Moshi, leaving the semantic (text) stream untouched. Similarly, we apply the watermark on all 4 streams of MusicGen. For the robustness experiments, we use the following diverse attack suite: •Speed perturbation:Resamples the audio to increase playback speed by a factor of 1....

2025

[25] [25]

325 164 9.412 Base 1 2048 9.044 c= 1 (0.5,

2048

[26] [26]

201 718 10.059 Base 1 2048 9.876 c= 2 (1.2,

2048

[27] [27]

119 1540 13.391 Base 1 2048 12.099 (0.5,

2048

[28] [28]

56 1628 15.843 Base 1 2048 15.230 (0.5,

2048