Watermarking for Proprietary Dataset Protection

Bhavya Kailkhura; Brian R. Bartoldson; John Kirchenbauer; Tom Goldstein

arxiv: 2607.00325 · v1 · pith:RLGGJGM2new · submitted 2026-07-01 · 💻 cs.LG · cs.CL

Watermarking for Proprietary Dataset Protection

John Kirchenbauer , Brian R. Bartoldson , Bhavya Kailkhura , Tom Goldstein This is my paper

Pith reviewed 2026-07-02 16:07 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords watermarkingmembership inferencedataset protectionlanguage modelsgenerative modelstraining data detection

0 comments

The pith

Watermarking allows comparable detection of whether proprietary datasets were used to train language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that output watermarking can serve as a viable alternative to loss-based membership inference for determining if a specific dataset appeared in a model's training data. It rests on the observation that models retain detectable watermark signals even when only a portion of their training data carried watermarks. Direct head-to-head tests show the watermark method reaches similar detection accuracy to traditional approaches once a large enough fraction of the dataset is included. This matters because it offers dataset owners a potential way to audit unauthorized training use under a different set of model assumptions than loss-based tests require.

Core claim

A watermark-based dataset inference method achieves membership detection performance comparable to loss-based membership inference when subset exposure during training is high enough, relying on residual watermark radioactivity in the trained model.

What carries the argument

Residual watermark radioactivity: the persistence of detectable output watermark signals in models trained on partially watermarked datasets.

If this is right

Watermarking provides an alternate route to membership detection that does not require direct access to model loss values.
Detection reliability increases with higher fractions of the proprietary dataset appearing in training.
The method trades the assumptions of loss-based inference for assumptions about watermark persistence instead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dataset owners could proactively watermark their data to enable future audits even if only part of it is copied.
The same principle might apply to non-language generative models if they also retain partial watermark signals.
It would be useful to test how robust the radioactivity signal remains when training involves data augmentation or filtering steps.

Load-bearing premise

Language models exhibit residual watermark radioactivity under partially watermarked training datasets.

What would settle it

A controlled training run on a dataset with partial watermarking that produces model outputs with no detectable watermark signal above random chance would falsify the core premise.

Figures

Figures reproduced from arXiv: 2607.00325 by Bhavya Kailkhura, Brian R. Bartoldson, John Kirchenbauer, Tom Goldstein.

**Figure 1.** Figure 1: To protect a proprietary dataset from unauthorized use in training, the dataset owner (attacker) paraphrases their documents with a secret watermark key. To perform dataset inference, the attacker tests the suspect model’s predictions for evidence of the watermark key. The watermark detection test is used to conclude whether their protected data was included in the training dataset. What makes training dat… view at source ↗

**Figure 2.** Figure 2: Finetuning event-split matched clean-twin false-probe null (left) and keyed signal (right) on the aligned unpacked detection surface, scored as − log10 p under an empirical-exact reference. each technique’s corresponding algorithm is used to produce the row-level MIA or whole-model DIA score, potentially using the rest of the clean models as reference models. Throughout, we present results in terms of − … view at source ↗

**Figure 3.** Figure 3: Finetuning event-split watermark whole-model DIA AUC across the F × E grid, on the aligned (left) and packed (right) detection surfaces, both scored against an empirical-exact null. The aligned chart on the left is based on the same probe surface as the right panel of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Event-grid null-validity panel. Histogram of pooled empirical exact p-values from the packed matched-clean-negative whole-model readings across the (F, E) event grid. The annotated statistic and p-value are from a one-sample Kolmogorov-Smirnov test of these pooled p-values against Uniform(0, 1); the dashed horizontal line is the expected per-bin count under a uniform histogram with the plotted binning. We… view at source ↗

**Figure 5.** Figure 5: Pretraining signal across the ten-schedule sweep on the aligned unpacked detection surface, scored as − log10 p under the exact empirical-null reference, for the CPT initialization. (Left) is the clean-twin false-probe-null models and (right) is the keyed model trained on the watermarked data. Schedule clearly modulates the keyed readout: at E1, none of the keyed signals are separably significant, but at l… view at source ↗

**Figure 6.** Figure 6: Pretraining signal across the ten-schedule sweep on the aligned unpacked detection surface, scored as − log10 p under the exact empirical-null reference, for the from-scratch initialization. (Left) is the clean-twin false-probe-null models and (right) is the keyed model trained on the watermarked data. The schedule strengthens the keyed readout much faster as a function of E in this setting versus CPT: at … view at source ↗

**Figure 7.** Figure 7: Finetuning event-split matched clean-twin false-probe null (left) and keyed signal (right) across the F × E grid on the aligned unpacked detection surface, scored as − log10 p under the empirical-Gaussian reference. C.2. Event-Split Exposure Trend Curves [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Finetuning event-split matched clean-twin false-probe null (left) and keyed signal (right) across the F × E grid on the packed detection surface, scored as − log10 p under the empirical-exact null. The packed surface is a more permissive oracle baseline than the aligned surface. 1 2 4 8 16 Effective Epochs (E) over Intervention Folds 2 4 8 Fold Granularity ( 16 F) 0.54 0.61 0.58 0.70 0.44 0.47 0.40 0.35 0.… view at source ↗

**Figure 9.** Figure 9: Finetuning event-split matched clean-twin false-probe null (left) and keyed signal (right) across the F × E grid on the packed detection surface, scored as − log10 p under the empirical-Gaussian reference. 1 2 4 8 16 Effective Epochs (E) over Intervention Folds 2 4 8 Fold Granularity ( 16 F) 0.75 1.00 1.00 1.00 1.00 0.56 0.72 0.74 1.00 1.00 0.60 0.71 0.92 1.00 1.00 0.57 0.65 0.84 1.00 1.00 0.5 0.6 0.7 0.8 … view at source ↗

**Figure 10.** Figure 10: Finetuning event-split watermark whole-model DIA AUC across the F × E grid, on the aligned (left) and packed (right) detection surfaces, both scored against an empirical-Gaussian null. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Finetuning event-split keyed-signal exposure response across the grid, tracing − log10 p as a function of effective per-key exposure with separate lines for each fold count F. This figure plots the same per-cell keyed-signal values as the right panel of [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Finetuning event-split cross-key sham-null heatmap across the F × E grid, on the aligned (left) and packed (right) detection surfaces, scored as − log10 p under the empirical-exact null. The watermark detector is queried with a key the target model never saw in training, providing a negative control beyond the matched clean-twin false-probe null. C.4. Event-Split Realized Exposure and Eˆ Per-Cell Readback… view at source ↗

**Figure 13.** Figure 13: Finetuning event-split cross-key sham-null heatmap across the F × E grid, on the aligned (left) and packed (right) detection surfaces, scored as − log10 p under the empirical-Gaussian null [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Event-grid null-validity panel. Left: histogram of pooled empirical exact p-values from the packed matched-clean-negative whole-model readings across the (F, E) event grid. The annotated statistic and p-value are from a one-sample Kolmogorov-Smirnov test of these pooled p-values against Uniform(0, 1); the dashed horizontal line is the expected per-bin count under a uniform histogram with the plotted binni… view at source ↗

**Figure 15.** Figure 15: Finetuning SKS matched clean-twin false-probe null (left) and keyed signal (right) across the F × E grid on the aligned unpacked detection surface, scored as − log10 p under an empirical-exact reference. The false-probe null stays quiet on the same surface, validating the matched clean-twin negative as the right baseline against which to read the keyed map; the keyed signal grows monotonically with E and … view at source ↗

**Figure 16.** Figure 16: Finetuning SKS matched clean-twin false-probe null (left) and keyed signal (right) across the F × E grid on the aligned unpacked detection surface, scored as − log10 p under the empirical-Gaussian reference. 1 2 4 8 16 Effective Epochs (E) over Intervention Folds 2 4 8 Fold Granularity ( 16 F) 0.40 0.65 0.65 0.95 0.25 0.16 0.12 0.16 0.34 0.27 0.42 0.27 0.43 0.65 0.16 0.38 0.22 0.35 0.28 0.08 1 2 3 4 5 6 −… view at source ↗

**Figure 17.** Figure 17: Finetuning SKS matched clean-twin false-probe null (left) and keyed signal (right) across the F × E grid on the packed detection surface, scored as − log10 p under the empirical-exact null. The packed surface is a more permissive oracle baseline than the aligned surface. 1 2 4 8 16 Effective Epochs (E) over Intervention Folds 2 4 8 Fold Granularity ( 16 F) 0.40 0.65 0.65 0.95 0.25 0.16 0.12 0.16 0.34 0.27… view at source ↗

**Figure 18.** Figure 18: Finetuning SKS matched clean-twin false-probe null (left) and keyed signal (right) across the F × E grid on the packed detection surface, scored as − log10 p under the empirical-Gaussian reference. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗

**Figure 19.** Figure 19: Finetuning SKS watermark whole-model DIA AUC across the F × E grid, on the aligned (left) and packed (right) detection surfaces, both scored against an empirical-exact null. On the aligned surface the AUC saturates at 1.0 from E = 8 onward at every F, while the lowest-E corner remains coarse with only eight trials per cell. On the packed surface, the more permissive oracle recovers several of those low-ex… view at source ↗

**Figure 20.** Figure 20: Finetuning SKS watermark whole-model DIA AUC across the F × E grid, on the aligned (left) and packed (right) detection surfaces, both scored against an empirical-Gaussian null [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗

**Figure 21.** Figure 21: Finetuning SKS keyed-signal exposure response across the grid, tracing − log10 p as a function of effective per-key exposure with separate lines for each fold count F. This figure plots the same per-cell keyed-signal values as the right panel of [PITH_FULL_IMAGE:figures/full_fig_p021_21.png] view at source ↗

**Figure 22.** Figure 22: SKS null-validity panel. Left: histogram of pooled empirical exact p-values from the packed clean-model watermark-surface false-probe rows across the depth-4 SKS grid. The annotated statistic and p-value are from a one-sample Kolmogorov-Smirnov test of these pooled p-values against Uniform(0, 1); the dashed horizontal line is the expected per-bin count under a uniform histogram with the plotted binning. R… view at source ↗

**Figure 23.** Figure 23: Pretraining CPT matched clean-twin false-probe null (left) and keyed signal (right) across the ten-schedule sweep at F = 2, on the aligned unpacked detection surface, scored as − log10 p under the empirical-Gaussian reference. S1E1 S2E1 S3E1 S4E1 UE4 PE4 UE8 PE8 UE16 PE16 1 2 3 4 5 6 − log10(p) → S1E1 S2E1 S3E1 S4E1 UE4 PE4 UE8 PE8 UE16 PE16 1 2 3 4 5 6 − log10(p) → Key 0 Key 1 [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 24.** Figure 24: Pretraining CPT matched clean-twin false-probe null (left) and keyed signal (right) across the ten-schedule sweep at F = 2, on the packed detection surface, scored as − log10 p under the empirical-exact null. The packed surface is a more permissive oracle baseline than the aligned surface [PITH_FULL_IMAGE:figures/full_fig_p024_24.png] view at source ↗

**Figure 25.** Figure 25: Pretraining CPT matched clean-twin false-probe null (left) and keyed signal (right) across the ten-schedule sweep at F = 2, on the packed detection surface, scored as − log10 p under the empirical-Gaussian reference. S1E1 S2E1 S3E1 S4E1 UE4 PE4 UE8 PE8 UE16 PE16 0.00 0.25 0.50 0.75 1.00 AUC S1E1 S2E1 S3E1 S4E1 UE4 PE4 UE8 PE8 UE16 PE16 0.00 0.25 0.50 0.75 1.00 AUC [PITH_FULL_IMAGE:figures/full_fig_p025_25.png] view at source ↗

**Figure 26.** Figure 26: Pretraining CPT watermark whole-model DIA AUC across the ten-schedule sweep at F = 2, on the aligned (left) and packed (right) detection surfaces, both scored against an empirical-exact null. Each schedule contributes 2+/2− whole-model trials, so AUC is coarse but tracks the keyed-signal ordering of [PITH_FULL_IMAGE:figures/full_fig_p025_26.png] view at source ↗

**Figure 27.** Figure 27: Pretraining CPT watermark whole-model DIA AUC across the ten-schedule sweep at F = 2, on the aligned (left) and packed (right) detection surfaces, both scored against an empirical-Gaussian null. Each schedule contributes 2+/2− whole-model trials. S1E1 S2E1 S3E1 S4E1 UE4 PE4 UE8 PE8 UE16 PE16 0 10 20 30 40 50 − log10(p) → S1E1 S2E1 S3E1 S4E1 UE4 PE4 UE8 PE8 UE16 PE16 0 10 20 30 40 50 − log10(p) → Key 0 Key… view at source ↗

**Figure 28.** Figure 28: Pretraining from-scratch matched clean-twin false-probe null (left) and keyed signal (right) across the ten-schedule sweep at F = 2, on the aligned unpacked detection surface, scored as − log10 p under the empirical-Gaussian reference. E.4. Pretraining Realized Exposure and Eˆ Per-Schedule Readbacks Tables 17 and 18 report the realized normalized exposure E/F ˆ per schedule for both pretraining initializa… view at source ↗

**Figure 29.** Figure 29: Pretraining from-scratch matched clean-twin false-probe null (left) and keyed signal (right) across the ten-schedule sweep at F = 2, on the packed detection surface, scored as − log10 p under the empirical-exact null. The packed surface is a more permissive oracle baseline than the aligned surface. S1E1 S2E1 S3E1 S4E1 UE4 PE4 UE8 PE8 UE16 PE16 0 50 100 150 200 250 − log10(p) → S1E1 S2E1 S3E1 S4E1 UE4 PE4 … view at source ↗

**Figure 30.** Figure 30: Pretraining from-scratch matched clean-twin false-probe null (left) and keyed signal (right) across the ten-schedule sweep at F = 2, on the packed detection surface, scored as − log10 p under the empirical-Gaussian reference [PITH_FULL_IMAGE:figures/full_fig_p027_30.png] view at source ↗

**Figure 31.** Figure 31: Pretraining from-scratch watermark whole-model DIA AUC across the ten-schedule sweep at F = 2, on the aligned (left) and packed (right) detection surfaces, both scored against an empirical-exact null. Each schedule contributes 2+/2− whole-model trials. The from-scratch DIA recovers near-saturated AUC at the high-exposure schedules well before CPT does. S1E1 S2E1 S3E1 S4E1 UE4 PE4 UE8 PE8 UE16 PE16 0.00 0.… view at source ↗

**Figure 32.** Figure 32: Pretraining from-scratch watermark whole-model DIA AUC across the ten-schedule sweep at F = 2, on the aligned (left) and packed (right) detection surfaces, both scored against an empirical-Gaussian null. Each schedule contributes 2+/2− whole-model trials [PITH_FULL_IMAGE:figures/full_fig_p028_32.png] view at source ↗

read the original abstract

A growing body of literature suggests that training data membership inference problems are fundamentally hard tasks in modern language modeling settings. We argue that output watermarking techniques are the right gadget to make training membership tests for generative models more tractable, based on prior results showing that language models exhibit residual watermark "radioactivity" under partially watermarked training datasets. We pit a watermark-based dataset inference approach head-to-head against traditional loss-based membership inference methods and show that watermarking can achieve comparable membership detection performance when subset exposure is high enough, under an alternate set of assumptions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Watermarking can match loss-based dataset inference when exposure is high enough, but the result rests on radioactivity persisting in partial mixtures.

read the letter

The main takeaway is that watermarking can match loss-based methods for detecting if a dataset was used in training a language model, provided enough of the data carries the watermark and the radioactivity effect holds.

The paper applies existing watermark techniques to the membership inference problem and runs a comparison showing similar performance under high subset exposure. This is new in the sense that it shifts the assumptions away from loss signals toward watermark signals, which could be useful if loss methods struggle in certain regimes.

It does well by grounding the method in prior radioactivity results and making the case for why this gadget fits the problem. The head-to-head is a solid way to position the contribution.

The soft spot is the reliance on residual watermark radioactivity under partial exposure. The stress-test note points this out correctly, and if the experiments don't demonstrate that the signal remains strong enough across the tested conditions, the comparable performance won't hold. The abstract alone doesn't give the numbers, so the full paper needs to show clear evidence there.

This is for researchers focused on protecting proprietary data and auditing AI models. A reader looking for new tools in data provenance would find it relevant.

It deserves serious referee time because the core idea is direct and the comparison is head-to-head, though revisions may be needed to strengthen the experimental support for the key assumption.

Recommendation: send it out for review.

Referee Report

2 major / 0 minor

Summary. The paper argues that output watermarking techniques can make training data membership inference more tractable for generative language models by leveraging residual watermark radioactivity from partially watermarked training datasets. It compares a watermark-based dataset inference method head-to-head against loss-based membership inference and claims comparable detection performance when subset exposure is high enough, under an alternate set of assumptions.

Significance. If the radioactivity effect holds under the relevant conditions, the approach could provide dataset owners with an alternative membership detection tool that operates under different assumptions than loss-based baselines, potentially aiding proprietary data protection. The work explicitly builds on prior radioactivity results rather than introducing new parameter fitting or self-referential definitions.

major comments (2)

[Abstract] Abstract: The headline claim of 'comparable membership detection performance' when subset exposure is high enough is load-bearing on the assumption that language models exhibit detectable residual watermark radioactivity under partially watermarked training data. The manuscript provides no new experiments, ablation on watermarked fraction, or quantitative tests of radioactivity strength to establish when (or whether) this effect is sufficient to match loss-based methods.
[Abstract] Abstract: No methods, experimental setup, model scales, watermarking schemes, exposure fractions, or performance metrics (e.g., detection AUC or precision-recall) are described, so it is impossible to evaluate whether the claimed comparability holds or under what precise conditions the 'high enough' threshold occurs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments. Our manuscript synthesizes prior results on watermark radioactivity to argue for an alternate approach to dataset inference under different assumptions from loss-based methods. We address the points below and will revise the abstract for clarity.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim of 'comparable membership detection performance' when subset exposure is high enough is load-bearing on the assumption that language models exhibit detectable residual watermark radioactivity under partially watermarked training data. The manuscript provides no new experiments, ablation on watermarked fraction, or quantitative tests of radioactivity strength to establish when (or whether) this effect is sufficient to match loss-based methods.

Authors: The manuscript explicitly builds on established prior results demonstrating residual watermark radioactivity under partially watermarked training data, rather than conducting new experiments. The head-to-head comparison is conceptual, showing that under the assumptions and conditions from those prior works (high subset exposure), watermark-based inference can achieve comparable performance to loss-based methods. We will revise the abstract to explicitly note the reliance on prior radioactivity findings and the alternate assumptions. revision: yes
Referee: [Abstract] Abstract: No methods, experimental setup, model scales, watermarking schemes, exposure fractions, or performance metrics (e.g., detection AUC or precision-recall) are described, so it is impossible to evaluate whether the claimed comparability holds or under what precise conditions the 'high enough' threshold occurs.

Authors: As a synthesis of existing literature, the abstract remains high-level and does not repeat the detailed experimental setups, model scales, or metrics from the cited prior works on watermarking and membership inference. The 'high enough' threshold refers to the subset exposure levels where prior radioactivity results show detectable effects. We will update the abstract to reference the specific prior results and conditions more explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparison is self-contained

full rationale

The paper's central claim is an empirical head-to-head comparison showing watermark-based dataset inference can match loss-based methods under high subset exposure, grounded in cited prior observations of watermark radioactivity. No equations, fitted parameters renamed as predictions, or self-definitional steps appear in the provided abstract or description. The radioactivity premise is treated as an external empirical fact from prior work rather than derived within this manuscript, and the performance result is presented as experimental rather than forced by construction or self-citation chains. This is a standard non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of residual watermark radioactivity in models trained on partially watermarked data; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption language models exhibit residual watermark radioactivity under partially watermarked training datasets
Explicitly stated in the abstract as the foundation for the watermark-based approach.

pith-pipeline@v0.9.1-grok · 5618 in / 1154 out tokens · 23406 ms · 2026-07-02T16:07:53.544152+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 13 canonical work pages · 3 internal anchors

[1]

Membership inference attacks from first prin- ciples

Carlini, N., Chien, S., Nasr, M., Song, S., Terzis, A., and Tramer, F. Membership inference attacks from first prin- ciples. In2022 IEEE symposium on security and privacy (SP), pp. 1897–1914. IEEE,

1914
[2]

Cooper, A

URL https://arxiv.org/abs/ 2306.09194. Cooper, A. F., Gokaslan, A., Ahmed, A., Cyphert, A. B., De Sa, C., Lemley, M. A., Ho, D. E., and Liang, P. Extracting memorized pieces of (copyrighted) books from open-weight language models.arXiv preprint arXiv:2505.12546,

work page arXiv
[3]

Do Membership Inference Attacks work on Large Language Models?

Duan, M., Suri, A., Mireshghallah, N., Min, S., Shi, W., Zettlemoyer, L., Tsvetkov, Y ., Choi, Y ., Evans, D., and Hajishirzi, H. Do membership inference attacks work on large language models?arXiv preprint arXiv:2402.07841,

work page arXiv
[4]

A., Jagielski, M., Kaissis, G., Nasr, M., Ghalebikesabi, S., Annamalai, M

Hayes, J., Shumailov, I., Choquette-Choo, C. A., Jagielski, M., Kaissis, G., Nasr, M., Ghalebikesabi, S., Annamalai, M. S. M. S., Mireshghallah, N., Shilov, I., et al. Explor- ing the limits of strong membership inference attacks on large language models.arXiv preprint arXiv:2505.18773,

work page arXiv
[6]

Kirchenbauer, J., Geiping, J., Wen, Y ., Katz, J., Miers, I., and Goldstein, T

URL https: //arxiv.org/abs/2202.00622. Kirchenbauer, J., Geiping, J., Wen, Y ., Katz, J., Miers, I., and Goldstein, T. A Watermark for Large Language Mod- els. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.),Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine ...

work page arXiv
[7]

Kirchenbauer, J., Mongkolsupawan, J., Wen, Y ., Goldstein, T., and Ippolito, D

URL https://arxiv.org/abs/2301.10226. Kirchenbauer, J., Mongkolsupawan, J., Wen, Y ., Goldstein, T., and Ippolito, D. A fictional q&a dataset for studying memorization and knowledge acquisition.arXiv preprint arXiv:2506.05639,

work page arXiv
[8]

URL https://arxiv.org/abs/1703.04730. Liu, K. Z., Choquette-Choo, C. A., Jagielski, M., Kairouz, P., Koyejo, S., Liang, P., and Papernot, N. Language mod- els may verbatim complete text they were not explicitly trained on.arXiv preprint arXiv:2503.17514,

work page arXiv
[9]

Olmo 3

URL https://arxiv.org/abs/2512.13961. 9 Watermarking for Proprietary Dataset Protection Park, S. M., Georgiev, K., Ilyas, A., Leclerc, G., and Madry, A. Trak: Attributing model behavior at scale. arXiv preprint arXiv:2303.14186,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Piet, J., Sitawarin, C., Fang, V ., Mu, N., and Wagner, D

URL https: //arxiv.org/abs/2303.14186. Piet, J., Sitawarin, C., Fang, V ., Mu, N., and Wagner, D. MARKMyWORDS: Analyzing and evaluating language model watermarks. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp. 68–

work page arXiv
[11]

Sander, T., Fernandez, P., Durmus, A., Douze, M., and Furon, T

URL https://arxiv.org/abs/ 2312.00273. Sander, T., Fernandez, P., Durmus, A., Douze, M., and Furon, T. Watermarking makes language models ra- dioactive.Advances in Neural Information Process- ing Systems, 37:21079–21113,

work page arXiv
[12]

Sander, T., Fernandez, P., Mahloujifar, S., Durmus, A., and Guo, C

URL https: //arxiv.org/abs/2402.14904. Sander, T., Fernandez, P., Mahloujifar, S., Durmus, A., and Guo, C. Detecting benchmark contamination through watermarking.arXiv preprint arXiv:2502.17259,

work page arXiv
[13]

Shi, W., Ajith, A., Xia, M., Huang, Y ., Liu, D., Blevins, T., Chen, D., and Zettlemoyer, L

URLhttps://arxiv.org/abs/2502.17259. Shi, W., Ajith, A., Xia, M., Huang, Y ., Liu, D., Blevins, T., Chen, D., and Zettlemoyer, L. Detecting pretraining data from large language models. InInternational Conference on Learning Representations, volume 2024, pp. 51826– 51843,

work page arXiv 2024
[14]

Antidistillation Fingerprinting

Xu, Y . E., Kirchenbauer, J., Savani, Y ., Trockman, A., Robey, A., Goldstein, T., Fang, F., and Kolter, J. Z. Antidistillation fingerprinting.arXiv preprint arXiv:2602.03812,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Qwen3 Technical Report

URL https://arxiv.org/abs/2505.09388. Yeom, S., Giacomelli, I., Fredrikson, M., and Jha, S. Privacy risk in machine learning: Analyzing the connection to overfitting. In2018 IEEE 31st computer security founda- tions symposium (CSF), pp. 268–282. IEEE,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Figures 3 and 10 carry the matching watermark whole-model DIA AUC pairs, with aligned and packed surfaces shown side-by-side under each p-value type. 1 2 4 8 16 Eﬀective Epochs ( E) over Intervention Folds 2 4 8 16Fold Granularity ( F ) 0.10 0.17 0.09 0.17 0.22 0.29 0.42 0.49 0.39 0.50 0.32 0.28 0.27 0.33 0.33 0.42 0.39 0.35 0.39 0.40 5.00 10.00 20.00 40....

1922
[18]

reading-mode

governs both regimes; what differs is sibling support and the per-cell positive/negative trial budget (Table 12). We use SKS as a watermark-only sanity check on the per-key scaling story without sibling-key interference, and run the loss-based and reference-model comparison only on the more realistic event-split regime where the row-level baselines are me...

2024
[19]

The discrepancy between idealized and realized values here is the realized overshoot of the watermarked subset’s effective epoch count relative to the planned schedule

Table 10 reports the underlying realized ˆE values. The discrepancy between idealized and realized values here is the realized overshoot of the watermarked subset’s effective epoch count relative to the planned schedule. D.4. SKS Training Scale and Trial Geometry Table 11 reports the per-cell watermarked-token totals (mean, min–max across paired models) a...

2062
[20]

The structure is grouped first by initialization regime (CPT then from-scratch), and within each by aligned-then-packed surface, exact-then-Gaussian p-value type, with the watermark whole-model DIA bar pairs trailing each init’s keyed/null block. The split row-level MIA and whole-model DIA baseline tables for both initialization regimes are reported below...

2050
[21]

The corresponding idealizedEtargets are the integers in the schedule names (columnEof Table 14)

Tables 19 and 20 report the underlying realized ˆEvalues. The corresponding idealizedEtargets are the integers in the schedule names (columnEof Table 14). E.5. Pretraining Training Scale and Trial Geometry Tables 21 and 22 report the per-schedule watermarked-token totals (mean, min–max across paired models) and the corresponding fraction of each model’s 1...

2050

[1] [1]

Membership inference attacks from first prin- ciples

Carlini, N., Chien, S., Nasr, M., Song, S., Terzis, A., and Tramer, F. Membership inference attacks from first prin- ciples. In2022 IEEE symposium on security and privacy (SP), pp. 1897–1914. IEEE,

1914

[2] [2]

Cooper, A

URL https://arxiv.org/abs/ 2306.09194. Cooper, A. F., Gokaslan, A., Ahmed, A., Cyphert, A. B., De Sa, C., Lemley, M. A., Ho, D. E., and Liang, P. Extracting memorized pieces of (copyrighted) books from open-weight language models.arXiv preprint arXiv:2505.12546,

work page arXiv

[3] [3]

Do Membership Inference Attacks work on Large Language Models?

Duan, M., Suri, A., Mireshghallah, N., Min, S., Shi, W., Zettlemoyer, L., Tsvetkov, Y ., Choi, Y ., Evans, D., and Hajishirzi, H. Do membership inference attacks work on large language models?arXiv preprint arXiv:2402.07841,

work page arXiv

[4] [4]

A., Jagielski, M., Kaissis, G., Nasr, M., Ghalebikesabi, S., Annamalai, M

Hayes, J., Shumailov, I., Choquette-Choo, C. A., Jagielski, M., Kaissis, G., Nasr, M., Ghalebikesabi, S., Annamalai, M. S. M. S., Mireshghallah, N., Shilov, I., et al. Explor- ing the limits of strong membership inference attacks on large language models.arXiv preprint arXiv:2505.18773,

work page arXiv

[5] [6]

Kirchenbauer, J., Geiping, J., Wen, Y ., Katz, J., Miers, I., and Goldstein, T

URL https: //arxiv.org/abs/2202.00622. Kirchenbauer, J., Geiping, J., Wen, Y ., Katz, J., Miers, I., and Goldstein, T. A Watermark for Large Language Mod- els. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.),Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine ...

work page arXiv

[6] [7]

Kirchenbauer, J., Mongkolsupawan, J., Wen, Y ., Goldstein, T., and Ippolito, D

URL https://arxiv.org/abs/2301.10226. Kirchenbauer, J., Mongkolsupawan, J., Wen, Y ., Goldstein, T., and Ippolito, D. A fictional q&a dataset for studying memorization and knowledge acquisition.arXiv preprint arXiv:2506.05639,

work page arXiv

[7] [8]

URL https://arxiv.org/abs/1703.04730. Liu, K. Z., Choquette-Choo, C. A., Jagielski, M., Kairouz, P., Koyejo, S., Liang, P., and Papernot, N. Language mod- els may verbatim complete text they were not explicitly trained on.arXiv preprint arXiv:2503.17514,

work page arXiv

[8] [9]

Olmo 3

URL https://arxiv.org/abs/2512.13961. 9 Watermarking for Proprietary Dataset Protection Park, S. M., Georgiev, K., Ilyas, A., Leclerc, G., and Madry, A. Trak: Attributing model behavior at scale. arXiv preprint arXiv:2303.14186,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [10]

Piet, J., Sitawarin, C., Fang, V ., Mu, N., and Wagner, D

URL https: //arxiv.org/abs/2303.14186. Piet, J., Sitawarin, C., Fang, V ., Mu, N., and Wagner, D. MARKMyWORDS: Analyzing and evaluating language model watermarks. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp. 68–

work page arXiv

[10] [11]

Sander, T., Fernandez, P., Durmus, A., Douze, M., and Furon, T

URL https://arxiv.org/abs/ 2312.00273. Sander, T., Fernandez, P., Durmus, A., Douze, M., and Furon, T. Watermarking makes language models ra- dioactive.Advances in Neural Information Process- ing Systems, 37:21079–21113,

work page arXiv

[11] [12]

Sander, T., Fernandez, P., Mahloujifar, S., Durmus, A., and Guo, C

URL https: //arxiv.org/abs/2402.14904. Sander, T., Fernandez, P., Mahloujifar, S., Durmus, A., and Guo, C. Detecting benchmark contamination through watermarking.arXiv preprint arXiv:2502.17259,

work page arXiv

[12] [13]

Shi, W., Ajith, A., Xia, M., Huang, Y ., Liu, D., Blevins, T., Chen, D., and Zettlemoyer, L

URLhttps://arxiv.org/abs/2502.17259. Shi, W., Ajith, A., Xia, M., Huang, Y ., Liu, D., Blevins, T., Chen, D., and Zettlemoyer, L. Detecting pretraining data from large language models. InInternational Conference on Learning Representations, volume 2024, pp. 51826– 51843,

work page arXiv 2024

[13] [14]

Antidistillation Fingerprinting

Xu, Y . E., Kirchenbauer, J., Savani, Y ., Trockman, A., Robey, A., Goldstein, T., Fang, F., and Kolter, J. Z. Antidistillation fingerprinting.arXiv preprint arXiv:2602.03812,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [16]

Qwen3 Technical Report

URL https://arxiv.org/abs/2505.09388. Yeom, S., Giacomelli, I., Fredrikson, M., and Jha, S. Privacy risk in machine learning: Analyzing the connection to overfitting. In2018 IEEE 31st computer security founda- tions symposium (CSF), pp. 268–282. IEEE,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [17]

Figures 3 and 10 carry the matching watermark whole-model DIA AUC pairs, with aligned and packed surfaces shown side-by-side under each p-value type. 1 2 4 8 16 Eﬀective Epochs ( E) over Intervention Folds 2 4 8 16Fold Granularity ( F ) 0.10 0.17 0.09 0.17 0.22 0.29 0.42 0.49 0.39 0.50 0.32 0.28 0.27 0.33 0.33 0.42 0.39 0.35 0.39 0.40 5.00 10.00 20.00 40....

1922

[16] [18]

reading-mode

governs both regimes; what differs is sibling support and the per-cell positive/negative trial budget (Table 12). We use SKS as a watermark-only sanity check on the per-key scaling story without sibling-key interference, and run the loss-based and reference-model comparison only on the more realistic event-split regime where the row-level baselines are me...

2024

[17] [19]

The discrepancy between idealized and realized values here is the realized overshoot of the watermarked subset’s effective epoch count relative to the planned schedule

Table 10 reports the underlying realized ˆE values. The discrepancy between idealized and realized values here is the realized overshoot of the watermarked subset’s effective epoch count relative to the planned schedule. D.4. SKS Training Scale and Trial Geometry Table 11 reports the per-cell watermarked-token totals (mean, min–max across paired models) a...

2062

[18] [20]

The structure is grouped first by initialization regime (CPT then from-scratch), and within each by aligned-then-packed surface, exact-then-Gaussian p-value type, with the watermark whole-model DIA bar pairs trailing each init’s keyed/null block. The split row-level MIA and whole-model DIA baseline tables for both initialization regimes are reported below...

2050

[19] [21]

The corresponding idealizedEtargets are the integers in the schedule names (columnEof Table 14)

Tables 19 and 20 report the underlying realized ˆEvalues. The corresponding idealizedEtargets are the integers in the schedule names (columnEof Table 14). E.5. Pretraining Training Scale and Trial Geometry Tables 21 and 22 report the per-schedule watermarked-token totals (mean, min–max across paired models) and the corresponding fraction of each model’s 1...

2050