Repurposing and Evaluating the (In)Feasibility of Dataset Poisoning enabled Watermarking for Contrastive Learning
Pith reviewed 2026-05-10 14:44 UTC · model grok-4.3
The pith
Trigger samples from data-poisoning attacks can be repurposed as verifiable watermarks for protecting contrastive learning datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Trigger samples from data-poisoning backdoor attacks exhibit distinguishable statistical divergence from clean samples in contrastive learning, which can be leveraged through a unified density metric for verification and a multi-level watermarking scheme that adapts to feature-level, soft-label, or hard-label outputs, allowing weak backdoor effects to serve as reliable signals for dataset IP protection despite the original attacks' limitations.
What carries the argument
The statistical divergence of trigger samples from clean data, quantified by a unified density metric and embedded through a multi-level watermarking scheme that matches different CL output formats.
If this is right
- Backdoor attacks with low success rates can still function as IP protection signals when paired with statistical verification.
- Watermarks can be embedded without requiring knowledge of any downstream task.
- A single poisoning method can support verification at feature, soft-label, or hard-label levels depending on the model output.
- Dataset owners gain a practical way to assert ownership even when full backdoor success is not achieved.
- Trade-offs among fidelity, verifiability, and robustness must be balanced for deployment in real CL pipelines.
Where Pith is reading between the lines
- The same statistical markers might be adapted to detect unauthorized use in other self-supervised learning settings beyond contrastive learning.
- Standard backdoor defenses could unintentionally strip these watermarks, creating a need for watermark-specific robustness tests.
- Dataset providers may need protocols to scan for embedded statistical signatures before releasing data publicly.
- Combining this technique with non-poisoning watermark methods could strengthen overall dataset protection strategies.
Load-bearing premise
Trigger samples from poisoning attacks maintain reliable statistical divergence from clean samples that can be verified without substantially harming contrastive learning performance or being removed by standard preprocessing.
What would settle it
An experiment in which common data augmentations or normalization steps used in contrastive learning eliminate the statistical divergence, rendering the density metric unable to distinguish trigger samples from clean ones.
Figures
read the original abstract
Contrastive learning (CL) reduces annotation cost via auto-derived supervisory signals. Since large-scale in-house CL datasets are infeasible, reliance on third-party or internet data is common. Recent studies show CL models are vulnerable to data-poisoning backdoor attacks, but their generalization and robustness are underexplored. We systematically evaluate existing data-poisoning backdoor attacks on CL, revealing limitations: poor dataset adaptability, low success rates, limited portability, and restrictive assumptions (e.g., downstream task knowledge). Interestingly, trigger samples exhibit distinguishable statistical divergence from clean samples, which inspires repurposing it as a watermark for dataset IP protection. Direct repurposing is challenging due to low success rates; we overcome this by statistical verification using a unified density metric. We further propose a multi-level watermarking scheme adapting to feature-level, soft-label, or hard-label outputs in CL. Experiments show some backdoor attacks can be repurposed as effective watermarks with trade-offs among fidelity, verifiability, and robustness. This work demonstrates weak backdoor effects become reliable signals for dataset IP protection in challenging CL settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates limitations of existing data-poisoning backdoor attacks when applied to contrastive learning (CL), including poor adaptability, low success rates, and restrictive assumptions. It observes statistical divergence between trigger and clean samples, repurposes this divergence as a dataset watermark via a unified density metric for statistical verification, and introduces a multi-level scheme supporting feature-level, soft-label, and hard-label outputs. Experiments are reported to show that certain backdoor attacks can be turned into effective watermarks, albeit with trade-offs in fidelity, verifiability, and robustness.
Significance. If the central claims hold, the work would demonstrate a practical route to dataset IP protection in CL settings by converting weak poisoning signals into verifiable watermarks without requiring new attack machinery. The systematic evaluation of backdoor limitations on CL is a clear positive contribution; the multi-level adaptation to different CL output formats could broaden applicability if the density metric proves stable.
major comments (3)
- [Abstract and Experiments section] Abstract and Experiments section: the claim that 'experiments show some backdoor attacks can be repurposed as effective watermarks' is not supported by any reported quantitative success rates, baseline comparisons against non-poisoning watermarking methods, or ablation results on the unified density metric; without these, the central repurposing claim cannot be assessed for practical utility.
- [Section describing the unified density metric] Section describing the unified density metric: no calibration procedure, threshold selection method, or invariance analysis under standard CL augmentations (random crops, color jitter, Gaussian blur) is provided; the skeptic concern that trigger divergence collapses under these operations directly undermines the verifiability guarantee required for a robust watermark.
- [Multi-level watermarking scheme] Multi-level watermarking scheme (feature/soft/hard-label variants): the paper does not report how the density metric is adapted across output types or whether fidelity to the original CL objective is preserved; this is load-bearing for the claim that the approach works 'in challenging CL settings.'
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., watermark detection accuracy or downstream accuracy drop) to convey the scale of the reported trade-offs.
- [Method section] Notation for the unified density metric should be defined explicitly with a formula or pseudocode early in the method section to avoid ambiguity when comparing trigger versus clean distributions.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications where possible and committing to revisions that strengthen the presentation of our results without overstating the current content.
read point-by-point responses
-
Referee: [Abstract and Experiments section] Abstract and Experiments section: the claim that 'experiments show some backdoor attacks can be repurposed as effective watermarks' is not supported by any reported quantitative success rates, baseline comparisons against non-poisoning watermarking methods, or ablation results on the unified density metric; without these, the central repurposing claim cannot be assessed for practical utility.
Authors: We acknowledge that the experiments section reports verification performance using the density metric but does not include the specific quantitative success rates, baseline comparisons to non-poisoning watermarking methods, or ablations on the density metric that the referee requests. To enable proper assessment of the repurposing claim, we will revise the experiments section to add explicit numerical success rates for watermark verification, baseline comparisons against existing non-poisoning watermarking approaches, and ablation studies isolating components of the unified density metric. revision: yes
-
Referee: [Section describing the unified density metric] Section describing the unified density metric: no calibration procedure, threshold selection method, or invariance analysis under standard CL augmentations (random crops, color jitter, Gaussian blur) is provided; the skeptic concern that trigger divergence collapses under these operations directly undermines the verifiability guarantee required for a robust watermark.
Authors: The referee correctly notes the absence of these methodological details in the current description of the unified density metric. We will add a new subsection that specifies the calibration procedure, the threshold selection method (e.g., via empirical quantiles on clean samples), and empirical invariance analysis under standard CL augmentations including random crops, color jitter, and Gaussian blur. This will either demonstrate stability of the statistical divergence or clearly delineate the conditions under which the verifiability guarantee holds. revision: yes
-
Referee: [Multi-level watermarking scheme] Multi-level watermarking scheme (feature/soft/hard-label variants): the paper does not report how the density metric is adapted across output types or whether fidelity to the original CL objective is preserved; this is load-bearing for the claim that the approach works 'in challenging CL settings.'
Authors: We agree that the adaptation of the density metric across output types and the preservation of fidelity to the CL objective require explicit reporting. The metric is applied directly to the respective representations (embeddings for feature-level, probability vectors for soft-label, and discrete predictions for hard-label). We will expand the scheme description to detail this adaptation and include new experimental results quantifying fidelity (e.g., change in contrastive loss and downstream accuracy) for each variant to support the claim in challenging CL settings. revision: yes
Circularity Check
No significant circularity; empirical observation of divergence plus new verification metric adds independent content.
full rationale
The paper starts from published backdoor attacks, empirically notes distinguishable statistical divergence in trigger samples, and introduces a unified density metric plus multi-level scheme to repurpose them as watermarks. This chain does not reduce to self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The central results (trade-offs in fidelity/verifiability/robustness) are presented as experimental outcomes rather than quantities forced by the inputs. No equations or derivations in the provided text exhibit the reduction patterns; the work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A survey on self-supervised learning: Algorithms, applications, and future trends,
J. Gui, T. Chen, J. Zhanget al., “A survey on self-supervised learning: Algorithms, applications, and future trends,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 12, pp. 9052–9071, 2024
work page 2024
-
[2]
Self-supervised learning: Generative or contrastive,
X. Liu, F. Zhang, Z. Houet al., “Self-supervised learning: Generative or contrastive,”IEEE Transactions on Knowledge & Data Engineering, vol. 35, no. 01, pp. 857–876, 2023
work page 2023
-
[3]
Dinov2: Learning robust visual features without supervision,
M. Oquab, T. Darcet, T. Moutakanniet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint, 2024
work page 2024
-
[4]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacyet al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning, ICML, 2021, pp. 8748–8763
work page 2021
-
[5]
Data-efficient contrastive language-image pretraining: Prioritizing data quality over quantity,
S. Joshi, A. Jain, A. Payani, and B. Mirzasoleiman, “Data-efficient contrastive language-image pretraining: Prioritizing data quality over quantity,” inInternational Conference on Artificial Intelligence and Statistics, AISTATS, 2024, pp. 1000–1008
work page 2024
-
[6]
When does contrastive visual representation learning work?
E. Cole, X. Yang, K. Wilberet al., “When does contrastive visual representation learning work?” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2022, pp. 1–10
work page 2022
-
[7]
CDI: copy- righted data identification in diffusion models,
J. Dubinski, A. Kowalczuk, F. Boenisch, and A. Dziedzic, “CDI: copy- righted data identification in diffusion models,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2025, pp. 18 674– 18 684
work page 2025
-
[8]
OBELICS: an open web-scale filtered dataset of interleaved image-text documents,
H. Laurenc ¸on, L. Saulnieret al., “OBELICS: an open web-scale filtered dataset of interleaved image-text documents,” inNeural Information Processing Systems, NeurIPS, 2023
work page 2023
-
[9]
Data poisoning based backdoor attacks to contrastive learning,
J. Zhang, H. Liu, J. Jia, and N. Z. Gong, “Data poisoning based backdoor attacks to contrastive learning,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2024, pp. 24 357–24 366
work page 2024
-
[10]
Momentum contrast for unsupervised visual representation learning,
K. He, H. Fan, Y . Wuet al., “Momentum contrast for unsupervised visual representation learning,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2020, pp. 9729–9738
work page 2020
-
[11]
Improved baselines with momentum contrastive learning,
X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,”arXiv preprint, 2020
work page 2020
-
[12]
Backdoor attacks on self-supervised learning,
A. Saha, A. Tejankar, S. A. Koohpayegani, and H. Pirsiavash, “Backdoor attacks on self-supervised learning,” inIEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR, 2022, pp. 13 337–13 346
work page 2022
-
[13]
Poisonedencoder: Poisoning the unlabeled pre-training data in contrastive learning,
H. Liu, J. Jia, and N. Z. Gong, “Poisonedencoder: Poisoning the unlabeled pre-training data in contrastive learning,” inUSENIX Security Symposium, USENIX Security, 2022, pp. 3629–3645
work page 2022
-
[14]
An embarrassingly simple backdoor attack on self-supervised learning,
C. Li, R. Pang, Z. Xiet al., “An embarrassingly simple backdoor attack on self-supervised learning,” inIEEE/CVF International Conference on Computer Vision, ICCV, 2023, pp. 4367–4378
work page 2023
-
[15]
Backdoor contrastive learning via bi-level trigger optimization,
W. Sun, X. Zhang, H. Luet al., “Backdoor contrastive learning via bi-level trigger optimization,” inInternational Conference on Learning Representations, ICLR, 2024
work page 2024
-
[16]
Backdooring self-supervised contrastive learning by noisy alignment,
T. Chen, J. Gui, M. Dong, J. Jia, L. Fang, and J. Liu, “Backdooring self-supervised contrastive learning by noisy alignment,” inIEEE/CVF International Conference on Computer Vision, ICCV, 2025
work page 2025
-
[17]
Baird Holm LLP, “Authors up in arms,” September 2025. [Online]. Available: https://www.bairdholm.com/blog/authors-up-in-arms/
work page 2025
-
[18]
Authors sue meta platforms over copyright infringement in ai training dataset,
Mogin Law LLP, “Authors sue meta platforms over copyright infringement in ai training dataset,” June 2025. [Online]. Available: https://www.lexology.com/library/detail.aspx?g= 0550195e-6912-4864-8a31-426796844a56
work page 2025
-
[19]
Sslguard: A watermarking scheme for self-supervised learning pre-trained encoders,
T. Cong, X. He, and Y . Zhang, “Sslguard: A watermarking scheme for self-supervised learning pre-trained encoders,” inACM SIGSAC Conference on Computer and Communications Security, CCS, 2022, pp. 579–593
work page 2022
-
[20]
SSL-WM: A black-box watermarking approach for encoders pre-trained by self-supervised learning,
P. Lv, P. Liet al., “SSL-WM: A black-box watermarking approach for encoders pre-trained by self-supervised learning,” inDistributed System Security Symposium, NDSS, 2024
work page 2024
-
[21]
Watermarking pre-trained encoders in contrastive learning,
Y . Wu, H. Qiu, T. Zhanget al., “Watermarking pre-trained encoders in contrastive learning,” inInternational Conference on Data Intelligence and Security, ICDIS 2022, 2022, pp. 228–233
work page 2022
-
[22]
Fit-print: Towards false-claim-resistant model ownership verification via targeted fingerprint,
S. Shao, H. Zhu, Y . Liet al., “Fit-print: Towards false-claim-resistant model ownership verification via targeted fingerprint,”arXiv preprint, 2025
work page 2025
-
[23]
C. Wei, Y . Wang, K. Gaoet al., “Pointncbw: Toward dataset own- ership verification for point clouds via negative clean-label backdoor watermark,”IEEE Transactions on Information Forensics and Security, vol. 20, pp. 191–206, 2025
work page 2025
-
[24]
M. Sun, R. Wang, Z. Zhuet al., “Entropymark: Towards more harm- less backdoor watermark via entropy-based constraint for open-source dataset copyright protection,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2025, pp. 30 692–30 701
work page 2025
-
[25]
BERT: pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” inNorth American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 4171–4186
work page 2019
-
[26]
Context encoders: Feature learning by inpainting,
D. Pathak, P. Kr ¨ahenb¨uhl, J. Donahueet al., “Context encoders: Feature learning by inpainting,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2016, pp. 2536–2544
work page 2016
-
[27]
A simple framework for contrastive learning of visual representations,
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inInternational conference on machine learning, ICML, 2020, pp. 1597–1607
work page 2020
-
[28]
Exploring simple siamese representation learning,
X. Chen and K. He, “Exploring simple siamese representation learning,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2021, pp. 15 750–15 758
work page 2021
-
[29]
Bootstrap your own latent - A new approach to self-supervised learning,
J. Grill, F. Strubet al., “Bootstrap your own latent - A new approach to self-supervised learning,” inNeural Information Processing Systems, NeurIPS, 2020
work page 2020
-
[30]
Generative adver- sarial networks,
I. J. Goodfellow, J. Pouget-Abadie, M. Mirzaet al., “Generative adver- sarial networks,”arXiv preprint, 2014
work page 2014
-
[31]
Auto-encoding variational bayes,
D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in International Conference on Learning Representations, ICLR, 2014
work page 2014
-
[32]
Poisoning and backdooring contrastive learn- ing,
N. Carlini and A. Terzis, “Poisoning and backdooring contrastive learn- ing,” inInternational Conference on Learning Representations, ICLR, 2022
work page 2022
-
[33]
Badencoder: Backdoor attacks to pre- trained encoders in self-supervised learning,
J. Jia, Y . Liu, and N. Z. Gong, “Badencoder: Backdoor attacks to pre- trained encoders in self-supervised learning,” inIEEE Symposium on Security and Privacy, SP, 2022, pp. 2043–2059
work page 2022
-
[34]
Q. Wang, C. Yin, L. Fanget al., “Ghostencoder: Stealthy backdoor attacks with dynamic triggers to pre-trained encoders in self-supervised learning,”Comput. Secur., vol. 142, p. 103855, 2024
work page 2024
-
[35]
Badclip: Dual-embedding guided backdoor attack on multimodal contrastive learning,
S. Liang, M. Zhu, A. Liuet al., “Badclip: Dual-embedding guided backdoor attack on multimodal contrastive learning,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2024, pp. 24 645–24 654
work page 2024
-
[36]
Badclip: Trigger-aware prompt learning for backdoor attacks on CLIP,
J. Bai, K. Gao, S. Minet al., “Badclip: Trigger-aware prompt learning for backdoor attacks on CLIP,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2024, pp. 24 239–24 250
work page 2024
-
[37]
Distribution preserving backdoor attack in self-supervised learning,
G. Tao, Z. Wang, S. Fenget al., “Distribution preserving backdoor attack in self-supervised learning,” inIEEE Symposium on Security and Privacy, SP, 2024, pp. 2029–2047
work page 2024
-
[38]
A reliable data-based bandwidth selection method for kernel density estimation,
S. J. Sheather and M. C. Jones, “A reliable data-based bandwidth selection method for kernel density estimation,”Journal of the Royal Statistical Society: Series B (Methodological), vol. 53, no. 3, pp. 683– 690, 1991. JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MAY 2026 15
work page 1991
-
[39]
Generalized sliced wasser- stein distances,
S. Kolouri, K. Nadjahi, U. Simsekliet al., “Generalized sliced wasser- stein distances,” inNeural Information Processing Systems, NeurIPS, 2019, pp. 261–272
work page 2019
-
[40]
R. Tang, Q. Feng, N. Liuet al., “Did you train on my dataset? towards public dataset protection with clean-label backdoor watermarking,” arXiv preprint, 2023
work page 2023
-
[41]
Untargeted backdoor watermark: Towards harmless and stealthy dataset copyright protection,
Y . Li, Y . Bai, Y . Jianget al., “Untargeted backdoor watermark: Towards harmless and stealthy dataset copyright protection,” inNeural Informa- tion Processing Systems, NeurIPS, 2022
work page 2022
-
[42]
Open-sourced dataset protection via backdoor watermarking,
Y . Li, Z. Zhang, J. Baiet al., “Open-sourced dataset protection via backdoor watermarking,”arXiv preprint, 2020
work page 2020
-
[43]
Dataset inference: Ownership resolution in machine learning,
P. Maini, M. Yaghini, and N. Papernot, “Dataset inference: Ownership resolution in machine learning,” inInternational Conference on Learning Representations, ICLR, 2021
work page 2021
-
[44]
Label- only membership inference attacks,
C. A. Choquette-Choo, F. Tram `er, N. Carlini, and N. Papernot, “Label- only membership inference attacks,” inInternational Conference on Machine Learning, ICML, vol. 139, 2021, pp. 1964–1974
work page 2021
-
[45]
Radioactive data: tracing through training,
A. Sablayrolles, M. Douze, C. Schmid, and H. J ´egou, “Radioactive data: tracing through training,” inInternational Conference on Machine Learning, ICML, vol. 119, 2020, pp. 8326–8335
work page 2020
-
[46]
Dataset inference for self-supervised models,
A. Dziedzic, H. Duanet al., “Dataset inference for self-supervised models,” inNeural Information Processing Systems, NeurIPS, 2022
work page 2022
-
[47]
Dataset ownership verification in contrastive pre-trained models,
Y . Xie, J. Songet al., “Dataset ownership verification in contrastive pre-trained models,” inInternational Conference on Learning Repre- sentations, ICLR, 2025
work page 2025
-
[48]
A dwt, dct and svd based watermarking technique to protect the image piracy,
M. M. Rahman, “A dwt, dct and svd based watermarking technique to protect the image piracy,”International Journal of Managing Public Sector Information & Communication Technologies, vol. 4, no. 2, pp. 21–32, 2013
work page 2013
-
[49]
Learning multiple layers of features from tiny images,
A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009
work page 2009
-
[50]
Deep residual learning for image recognition,
K. He, X. Zhanget al., “Deep residual learning for image recognition,” inIEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2016, pp. 770–778
work page 2016
-
[51]
Imagenet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socheret al., “Imagenet: A large-scale hierarchical image database,” inIEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, 2009, pp. 248–255
work page 2009
-
[52]
Image quality assess- ment: from error visibility to structural similarity,
Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assess- ment: from error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004
work page 2004
-
[53]
The unreasonable effectiveness of deep features as a perceptual metric,
R. Zhang, P. Isola, A. A. Efroset al., “The unreasonable effectiveness of deep features as a perceptual metric,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2018, pp. 586–595
work page 2018
-
[54]
Dreamsim: Learning new dimen- sions of human visual similarity using synthetic data,
S. Fu, N. Tamir, S. Sundaramet al., “Dreamsim: Learning new dimen- sions of human visual similarity using synthetic data,”arXiv preprint, 2023
work page 2023
-
[55]
An analysis of single-layer networks in unsupervised feature learning,
A. Coates, A. Y . Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” inInternational Conference on Artificial Intelligence and Statistics, AISTATS, vol. 15, 2011, pp. 215–223
work page 2011
-
[56]
https://tensorflow.google.cn/datasets/catalog/imagenette
“https://tensorflow.google.cn/datasets/catalog/imagenette.” Zhiyang Daireceived the bachelor’s degree in Qian Xuesen College from Nanjing University of Science and Technology, Nanjing, China, in 2021, where he is currently pursuing the Ph.D. degree with the School of Cyber Science and Engineering from Nan- jing University of Science and Technology, Nanjin...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.