Recognition: unknown
ArmSSL: Adversarial Robust Black-Box Watermarking for Self-Supervised Learning Pre-trained Encoders
Pith reviewed 2026-05-08 11:34 UTC · model grok-4.3
The pith
ArmSSL watermarks self-supervised encoders so owners can verify theft via black-box queries while the marks resist detection and removal.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By enlarging the feature-space discrepancy between each clean input and its paired watermark counterpart, ArmSSL produces a reliable black-box verification signal. Latent representation entanglement mixes watermark features with non-source-class clean features to prevent dense clusters, while distribution alignment reduces statistical differences so watermark samples appear in-distribution. Reference-guided tuning aligns the watermarked encoder's outputs on clean data with the original encoder's outputs, keeping utility intact. Experiments across five SSL frameworks and nine datasets show the approach maintains verification accuracy, near-zero utility loss, and resistance to multiple removal
What carries the argument
Paired discrepancy enlargement for verification, combined with latent representation entanglement and distribution alignment to disguise watermark samples as ordinary data.
If this is right
- Owners obtain a statistical test on query pairs that confirms ownership even after the encoder is used in new tasks.
- Watermark samples no longer form isolated clusters, blocking simple outlier-based detection methods.
- Downstream task accuracy stays essentially the same as the unmarked encoder.
- The scheme works for multiple popular self-supervised frameworks without requiring white-box access.
- End-to-end comparisons show better trade-offs than prior watermarking techniques on the same benchmarks.
Where Pith is reading between the lines
- If the hiding technique generalizes, similar entanglement and alignment steps could protect other types of pre-trained models beyond SSL.
- Widespread use might lower the practical value of stealing encoders, because verification becomes feasible even after deployment.
- The method opens the question of whether stronger future attacks could still separate the entangled representations without large utility cost.
Load-bearing premise
Entangling watermark representations with clean ones and aligning their distributions will keep watermark samples from forming a detectable out-of-distribution cluster while still allowing reliable verification and unchanged downstream performance.
What would settle it
An attack that either extracts or erases the watermark without harming downstream accuracy, or that reliably flags the watermark samples as out-of-distribution despite the entanglement and alignment steps.
Figures
read the original abstract
Self-supervised learning (SSL) encoders are invaluable intellectual property (IP). However, no existing SSL watermarking for IP protection can concurrently satisfy the following two practical requirements: (1) provide ownership verification capability under black-box suspect model access once the stolen encoders are used in downstream tasks; (2) be robust under adversarial watermark detection or removal, because the watermark samples form a distinguishable out-of-distribution (OOD) cluster. We propose ArmSSL, an SSL watermarking framework that assures black-box verifiability and adversarial robustness while preserving utility. For verification, we introduce paired discrepancy enlargement, enforcing feature-space orthogonality between the clean and its watermark counterpart to produce a reliable verification signal in black-box against the suspect model. For adversarial robustness, ArmSSL integrates latent representation entanglement and distribution alignment to suppress the OOD clustering. The former entangles watermark representations with clean representations (i.e., from non-source-class) to avoid forming a dense cluster of watermark samples, while the latter minimizes the distributional discrepancy between watermark and clean representations, thereby disguising watermark samples as natural in-distribution data. For utility, a reference-guided watermark tuning strategy is designed to allow the watermark to be learned as a small side task without affecting the main task by aligning the watermarked encoder's outputs with those of the original clean encoder on normal data. Extensive experiments across five mainstream SSL frameworks and nine benchmark datasets, along with end-to-end comparisons with SOTAs, demonstrate that ArmSSL achieves superior ownership verification, negligible utility degradation, and strong robustness against various adversarial detection and removal.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ArmSSL, a watermarking framework for self-supervised learning (SSL) pre-trained encoders. It enables black-box ownership verification via paired discrepancy enlargement that enforces feature-space orthogonality between clean and watermark sample pairs. Adversarial robustness is achieved by combining latent representation entanglement (to avoid dense watermark clusters) with distribution alignment (to disguise watermark samples as in-distribution). Utility preservation uses reference-guided watermark tuning that aligns watermarked encoder outputs with the original on clean data. Experiments across five SSL frameworks and nine datasets, with comparisons to SOTAs, claim superior verification reliability, negligible utility loss, and strong robustness to detection/removal attacks.
Significance. If the central claims hold, the work is significant for IP protection of SSL encoders, filling gaps in prior schemes that lack concurrent black-box verifiability and adversarial robustness. The multi-component design (orthogonality for signal, entanglement+alignment for OOD suppression, reference tuning for utility) and broad experimental coverage across frameworks/datasets are strengths; reproducible code or parameter-free derivations are not mentioned.
major comments (2)
- [§3.2] §3.2 (paired discrepancy enlargement and distribution alignment): the claim that alignment can suppress OOD clustering while preserving a reliable orthogonality-based verification signal is load-bearing, yet no ablation quantifies the verification metric (e.g., AUC or cosine discrepancy magnitude) before versus after the alignment step; this leaves the compatibility of minimizing distributional discrepancy with maintaining feature-space orthogonality unverified.
- [§4] §4 (experimental evaluation): while end-to-end comparisons with SOTAs are reported, the manuscript does not include per-attack success rates, threshold choices for black-box verification, or statistical significance tests across the nine datasets, making it difficult to assess whether the claimed superiority is robust to hyperparameter variation.
minor comments (2)
- Notation for the combined loss (entanglement + alignment + reference terms) is introduced without an explicit equation numbering the full objective; adding this would improve clarity.
- Figure captions for the OOD visualization and verification ROC curves could explicitly state the number of runs and error bars used.
Simulated Author's Rebuttal
Dear Editor, We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the empirical support for our claims. All suggested additions are feasible with our existing experimental setup.
read point-by-point responses
-
Referee: [§3.2] §3.2 (paired discrepancy enlargement and distribution alignment): the claim that alignment can suppress OOD clustering while preserving a reliable orthogonality-based verification signal is load-bearing, yet no ablation quantifies the verification metric (e.g., AUC or cosine discrepancy magnitude) before versus after the alignment step; this leaves the compatibility of minimizing distributional discrepancy with maintaining feature-space orthogonality unverified.
Authors: We agree that an explicit ablation would better demonstrate the compatibility of the components. The paired discrepancy enlargement loss directly optimizes feature-space orthogonality between clean-watermark pairs, while distribution alignment minimizes marginal distributional discrepancy to suppress OOD clustering without altering the pairwise constraint. In the revised manuscript, we will add an ablation study reporting verification AUC and average cosine discrepancy magnitudes on the nine datasets both before and after the alignment step. This will empirically confirm that the orthogonality signal remains reliable (AUC > 0.95) post-alignment. revision: yes
-
Referee: [§4] §4 (experimental evaluation): while end-to-end comparisons with SOTAs are reported, the manuscript does not include per-attack success rates, threshold choices for black-box verification, or statistical significance tests across the nine datasets, making it difficult to assess whether the claimed superiority is robust to hyperparameter variation.
Authors: We acknowledge that greater granularity in the experimental section would aid assessment of robustness. In the revision, we will expand §4 to include: (i) per-attack success rates (detection and removal) for all baselines and ArmSSL across the five SSL frameworks; (ii) explicit description of black-box verification threshold selection (e.g., 95th percentile of clean-pair discrepancies on a held-out set); and (iii) statistical significance tests (paired t-tests with p-values) comparing verification AUC and utility metrics across the nine datasets. These additions will show that superiority holds under the reported hyperparameter ranges. revision: yes
Circularity Check
No circularity in claimed derivation chain
full rationale
The paper's core contributions are presented as independent design choices: paired discrepancy enlargement for verification signal, latent entanglement plus distribution alignment for OOD suppression, and reference-guided tuning for utility preservation. These are motivated by distinct goals and do not reduce to self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or derivation steps are exhibited that equate outputs to inputs by construction. The potential tension between alignment and discrepancy (noted in skeptic analysis) is a compatibility question, not a circularity reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A simple framework for contrastive learning of visual representations,
T. Chen, S. Kornblith, M. Norouziet al., “A simple framework for contrastive learning of visual representations,” inProc. Int. Conf. Mach. Learn. (ICML), 2020, pp. 1597–1607
2020
-
[2]
A survey on self-supervised learning: Algorithms, applications, and future trends,
J. Gui, T. Chen, J. Zhanget al., “A survey on self-supervised learning: Algorithms, applications, and future trends,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 12, pp. 9052–9071, 2024
2024
-
[3]
Hssas: Optimizing nlp models by combining self-supervised learning and neural architecture search,
S. Qi, D. Wang, Y . Fanet al., “Hssas: Optimizing nlp models by combining self-supervised learning and neural architecture search,” inProc. Int. Conf. Robot. Autom. Intell. Control (ICRAIC), 2024, pp. 71–75
2024
-
[4]
Deepdriving: Learning affordance for direct perception in autonomous driving,
C. Chen, A. Seff, A. L. Kornhauseret al., “Deepdriving: Learning affordance for direct perception in autonomous driving,” inProc. ICCV. IEEE Computer Society, 2015, pp. 2722–2730
2015
-
[5]
https://ai.meta.com/blog/self-supervised- learning-the-dark-matter-of-intelligence,
Y . LeCun and I. Misra, “https://ai.meta.com/blog/self-supervised- learning-the-dark-matter-of-intelligence,” March 4, 2021
2021
-
[6]
Sslguard: A watermarking scheme for self-supervised learning pre-trained encoders,
T. Cong, X. He, and Y . Zhang, “Sslguard: A watermarking scheme for self-supervised learning pre-trained encoders,” inACM Conf. Comput. Commun. Secur. (CCS), 2022, pp. 579–593
2022
-
[7]
Stolenencoder: stealing pre-trained en- coders in self-supervised learning,
Y . Liu, J. Jia, H. Liuet al., “Stolenencoder: stealing pre-trained en- coders in self-supervised learning,” inACM Conf. Comput. Commun. Secur. (CCS), 2022, pp. 2115–2128
2022
-
[8]
On the difficulty of defending self-supervised learning against model extraction,
A. Dziedzic, N. Dhawan, M. A. Kaleemet al., “On the difficulty of defending self-supervised learning against model extraction,” in International Conference on Machine Learning. PMLR, 2022, pp. 5757–5776
2022
-
[9]
Ssl-wm: A black-box watermarking approach for encoders pre-trained by self-supervised learning,
P. Lv, P. Li, S. Zhuet al., “Ssl-wm: A black-box watermarking approach for encoders pre-trained by self-supervised learning,”Proc. Netw. Distrib. Syst. Secur. Symp. (NDSS), 2024
2024
-
[10]
Watermarking pre-trained encoders in contrastive learning,
Y . Wu, H. Qiu, T. Zhanget al., “Watermarking pre-trained encoders in contrastive learning,” inProc. Int. Conf. Data Intell. Secur. (ICDIS). IEEE, 2022, pp. 228–233
2022
-
[11]
Detecting backdoors in pre-trained encoders,
S. Feng, G. Tao, S. Chenget al., “Detecting backdoors in pre-trained encoders,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 16 352–16 362
2023
-
[12]
Mm-bd: Post-training detection of backdoor attacks with arbitrary backdoor pattern types using a maximum margin statistic,
H. Wang, Z. Xiang, D. J. Milleret al., “Mm-bd: Post-training detection of backdoor attacks with arbitrary backdoor pattern types using a maximum margin statistic,” inProc. IEEE Symp. Secur. Priv. (SP), 2024, pp. 1994–2012
2024
-
[13]
Generalized sliced wasser- stein distances,
S. Kolouri, K. Nadjahi, U. Simsekliet al., “Generalized sliced wasser- stein distances,”Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019
2019
-
[14]
A comparison review of transfer learning and self-supervised learning: Definitions, applica- tions, advantages and limitations,
Z. Zhao, L. Alzubaidi, J. Zhanget al., “A comparison review of transfer learning and self-supervised learning: Definitions, applica- tions, advantages and limitations,”Expert Syst. Appl., vol. 242, p. 122807, 2024
2024
-
[15]
Improved Baselines with Momentum Contrastive Learning
X. Chen, H. Fan, R. Girshicket al., “Improved baselines with momentum contrastive learning,”arXiv preprint arXiv:2003.04297, 2020
work page internal anchor Pith review arXiv 2003
-
[16]
Bootstrap your own latent-a new approach to self-supervised learning,
J.-B. Grill, F. Strub, F. Altch ´eet al., “Bootstrap your own latent-a new approach to self-supervised learning,”Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, pp. 21 271–21 284, 2020
2020
-
[17]
Exploring simple siamese representation learning,
X. Chen and K. He, “Exploring simple siamese representation learning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 15 750–15 758
2021
-
[18]
Dinov2: Learning robust visual features without supervision,
M. Oquab, T. Darcet, T. Moutakanniet al., “Dinov2: Learning robust visual features without supervision,”Trans. Mach. Learn. Res., 2024
2024
-
[19]
Embedding watermarks into deep neural networks,
Y . Uchida, Y . Nagai, S. Sakazawaet al., “Embedding watermarks into deep neural networks,” inProc. Int. Conf. Multimedia Retr. (ICMR), 2017, pp. 269–277
2017
-
[20]
Deepmarks: A secure finger- printing framework for digital rights management of deep learning models,
H. Chen, B. D. Rouhani, C. Fuet al., “Deepmarks: A secure finger- printing framework for digital rights management of deep learning models,” inProc. Int. Conf. Multimedia Retr. (ICMR), 2019, pp. 105– 113
2019
-
[21]
Watermarking deep neural networks with greedy residuals
H. Liu, Z. Weng, and Y . Zhu, “Watermarking deep neural networks with greedy residuals.” inProc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 6978–6988
2021
-
[22]
Robust watermarking of neural network with exponential weighting,
R. Namba and J. Sakuma, “Robust watermarking of neural network with exponential weighting,” inACM Conf. Comput. Commun. Secur. (CCS), 2019, pp. 228–240
2019
-
[23]
Deep- eclipse: How to break white-box dnn-watermarking schemes,
A. Pegoraro, C. Segna, K. Kumari, and A.-R. Sadeghi, “Deep- eclipse: How to break white-box dnn-watermarking schemes,” in Proc. USENIX Secur. Symp. (USENIX Security ’24), 2024, pp. 5287– 5304
2024
-
[24]
Entangled watermarks as a defense against model extraction,
H. Jia, C. A. Choquette-Choo, V . Chandrasekaranet al., “Entangled watermarks as a defense against model extraction,” in30th USENIX security symposium (USENIX Security 21), 2021, pp. 1937–1954
2021
-
[25]
Move: Effective and harmless ownership verification via embedded external features,
Y . Li, L. Zhu, X. Jiaet al., “Move: Effective and harmless ownership verification via embedded external features,”IEEE Trans. Pattern Anal. Mach. Intell., 2025
2025
-
[26]
Explanation as a watermark: Towards harmless and multi-bit model ownership verification via watermarking feature attribution,
S. Shao, Y . Li, H. Yaoet al., “Explanation as a watermark: Towards harmless and multi-bit model ownership verification via watermarking feature attribution,”Proc. Netw. Distrib. Syst. Secur. Symp. (NDSS), 2025
2025
-
[27]
Robust watermarking for deep neural networks via bi-level optimization,
P. Yang, Y . Lao, and P. Li, “Robust watermarking for deep neural networks via bi-level optimization,” inProc. IEEE Int. Conf. Comput. Vis. (ICCV), 2021, pp. 14 841–14 850
2021
-
[28]
Aime: Watermarking ai models by leveraging errors,
D. Mehta, N. Mondol, F. Farahmandiet al., “Aime: Watermarking ai models by leveraging errors,” inProc. Des. Autom. Test Eur.(DATE). IEEE, 2022, pp. 304–309
2022
-
[29]
Dataset ownership verification in contrastive pre-trained models,
Y . Xie, J. Song, M. Xue, H. Zhang, X. Wang, B. Hu, G. Chen, and M. Song, “Dataset ownership verification in contrastive pre-trained models,”Proc. Int. Conf. Learn. Represent. (ICLR), 2025
2025
-
[30]
Dataset inference for self-supervised models,
A. Dziedzic, H. Duan, M. A. Kaleemet al., “Dataset inference for self-supervised models,”Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, pp. 12 058–12 070, 2022
2022
-
[31]
Bucks for buckets (b4b): Active defenses against stealing encoders,
J. Dubi ´nski, S. Pawlak, F. Boenischet al., “Bucks for buckets (b4b): Active defenses against stealing encoders,”Advances in Neural Information Processing Systems, vol. 36, pp. 55 237–55 259, 2023
2023
-
[32]
Warden: Multi-directional backdoor watermarks for embedding-as-a-service copyright protec- tion,
A. Shetty, Y . Teng, K. He, and Q. Xu, “Warden: Multi-directional backdoor watermarks for embedding-as-a-service copyright protec- tion,”Proc. Annu. Meet. Assoc. Comput Linguist., pp. 13 430–13 444, 2024
2024
-
[33]
Your fixed watermark is fragile: Towards semantic-aware watermark for eaas copyright protection,
Z. Fei, B. Yi, J. Genget al., “Your fixed watermark is fragile: Towards semantic-aware watermark for eaas copyright protection,”arXiv e- prints, pp. arXiv–2411, 2024
2024
-
[34]
Watermarking vision-language pre- trained models for multi-modal embedding as a service,
Y . Tang, J. Yu, K. Gaiet al., “Watermarking vision-language pre- trained models for multi-modal embedding as a service,”arXiv preprint arXiv:2311.05863, 2023
-
[35]
Effective backdoor defense by exploiting sensitivity of poisoned samples,
W. Chen, B. Wu, and H. Wang, “Effective backdoor defense by exploiting sensitivity of poisoned samples,”Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), pp. 9727–9737, 2022
2022
-
[36]
Distributions of angles in random packing on spheres,
T. Cai, J. Fan, and T. Jiang, “Distributions of angles in random packing on spheres,”J. Mach. Learn. Res., vol. 14, no. 1, pp. 1837– 1864, 2013
2013
-
[37]
Algorithm as 136: A k-means clustering algorithm,
J. A. Hartigan and M. A. Wong, “Algorithm as 136: A k-means clustering algorithm,”Journal of the royal statistical society. series c (applied statistics), vol. 28, no. 1, pp. 100–108, 1979
1979
-
[38]
A density-based algorithm for discovering clusters in large spatial databases with noise,
M. Ester, H.-P. Kriegel, J. Sander, X. Xuet al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” inkdd, vol. 96, no. 34, 1996, pp. 226–231
1996
-
[39]
Distribution preserving backdoor attack in self-supervised learning,
G. Tao, Z. Wang, S. Fenget al., “Distribution preserving backdoor attack in self-supervised learning,” inProc. IEEE Symp. Secur. Priv. (SP), 2024, pp. 2029–2047
2024
-
[40]
An introduction to mathematical statistics (hoboken, nj,
R. J. Larsen and M. L. Marx, “An introduction to mathematical statistics (hoboken, nj,” 2005
2005
-
[41]
Division and union: Latent model watermarking,
Z. Dai, Y . Gao, B. Kuanget al., “Division and union: Latent model watermarking,”IEEE Transactions on Information Forensics and Security, 2025
2025
-
[42]
Badencoder: Backdoor attacks to pre- trained encoders in self-supervised learning,
J. Jia, Y . Liu, and N. Z. Gong, “Badencoder: Backdoor attacks to pre- trained encoders in self-supervised learning,” inProc. IEEE Symp. Secur. Priv. (SP). IEEE, 2022, pp. 2043–2059
2022
-
[43]
Machine unlearning: Taxonomy, metrics, applications, challenges, and prospects,
N. Li, C. Zhou, Y . Gaoet al., “Machine unlearning: Taxonomy, metrics, applications, challenges, and prospects,”IEEE Trans. Neural Netw. Learn. Syst., 2025
2025
-
[44]
Imagenet: A large-scale hierarchical image database,
J. Deng, W. Donget al., “Imagenet: A large-scale hierarchical image database,” inProc. CVPR, 2009, pp. 248–255
2009
-
[45]
Tiny imagenet visual recognition challenge,
Y . Le and X. Yang, “Tiny imagenet visual recognition challenge,”CS 231N, vol. 7, no. 7, p. 3, 2015
2015
-
[46]
https://tensorflow.google.cn/datasets/catalog/imagenette
“https://tensorflow.google.cn/datasets/catalog/imagenette.”
-
[47]
https://tensorflow.google.cn/datasets/catalog/cifar100
“https://tensorflow.google.cn/datasets/catalog/cifar100.”
-
[48]
Learning multiple layers of features from tiny images
G. H. A. Krizhevskyet al., “Learning multiple layers of features from tiny images.” 2009
2009
-
[49]
An analysis of single-layer networks in unsupervised feature learning,
A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” inProceedings of the fourteenth in- ternational conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2011, pp. 215–223
2011
-
[50]
Cinic-10 is not imagenet or cifar-10.arXiv preprint arXiv:1810.03505, 2018
L. N. Darlow, E. J. Crowley, A. Antoniou, and A. J. Storkey, “Cinic- 10 is not imagenet or cifar-10,”arXiv preprint arXiv:1810.03505, 2018
-
[51]
http://ufldl.stanford.edu/housenumbers
“http://ufldl.stanford.edu/housenumbers.”
-
[52]
The german traffic sign recogni- tion benchmark: a multi-class classification competition,
J. Stallkamp, M. Schlipsinget al., “The german traffic sign recogni- tion benchmark: a multi-class classification competition,” inThe 2011 international joint conference on neural networks. IEEE, 2011, pp. 1453–1460. Appendix A. Algorithm A.1. Algorithm For Watermark Embedding The algorithm for the watermark embedding is presented in Algorithm 1. A.2. Alg...
2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.