DIPBox: A Multi-scale Testing Framework for Tracking Dataset Regeneration

Guoxing Chen; Hao Chen; Haojin Zhu; Shaofeng Li; Tian Dong; Yan Meng; Yuling Chen; Zhen Liu

arxiv: 2606.21240 · v1 · pith:OFI4YMLSnew · submitted 2026-06-19 · 💻 cs.CR · cs.CV

DIPBox: A Multi-scale Testing Framework for Tracking Dataset Regeneration

Tian Dong , Yan Meng , Shaofeng Li , Guoxing Chen , Yuling Chen , Zhen Liu , Haojin Zhu , Hao Chen This is my paper

Pith reviewed 2026-06-26 14:13 UTC · model grok-4.3

classification 💻 cs.CR cs.CV

keywords dataset regenerationsimilarity metricsmulti-scale testingdata provenanceutility-divergence trade-offadversarial attacksdataset trackingmodel utility

0 comments

The pith

DIPBox detects unauthorized dataset regeneration by checking similarity at sample, set, and distribution scales under varying defender access.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that regenerating a dataset while preserving its utility for downstream models necessarily leaves measurable traces across multiple feature scales. This matters because a measurement study found substantial partial replication already occurring in public tumor datasets, and existing defenses target only individual data points rather than whole-set copying. DIPBox implements four similarity metrics to identify such regeneration suspects. A learning-theoretic analysis shows an inherent utility-divergence trade-off that sets fundamental limits on how evasive any regeneration attempt can be. Experiments across 16 base datasets, 320 regenerated versions, and 590 derived models show the approach outperforms prior methods and holds up under adaptive attacks.

Core claim

Regeneration that preserves model utility inevitably preserves measurable signals across sample-, set-, and distribution-level features. These signals are captured by four similarity metrics that allow accurate identification of regeneration suspects. DIPBox is the first testing framework that applies multi-scale similarity testing across a spectrum of defender access settings from limited to full information, with a learning-theoretic analysis that formalizes the utility-divergence trade-off and its implied limits on evasive regeneration.

What carries the argument

DIPBox multi-scale similarity testing framework using four metrics on sample-, set-, and distribution-level features, applied across limited-to-full defender information settings.

If this is right

Regenerated datasets remain identifiable even when they preserve high model utility.
Detection works under partial defender information about the original dataset.
An inherent trade-off limits how far regeneration can diverge without sacrificing utility.
The approach outperforms prior single-scale or point-based tracking methods.
Robustness holds under three forms of adaptive attack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Multi-scale checks could extend to auditing other data-misuse patterns such as partial extraction or synthetic augmentation.
License holders might apply the metrics to verify compliance when models are trained on claimed datasets.
Complete evasion without utility loss may prove impractical, raising the effective cost of unauthorized copying.

Load-bearing premise

Regeneration that keeps model utility must also preserve detectable similarity signals at sample, set, and distribution scales.

What would settle it

A regeneration procedure that trains models to the same accuracy as the original dataset yet produces low scores on all four similarity metrics at every scale.

Figures

Figures reproduced from arXiv: 2606.21240 by Guoxing Chen, Hao Chen, Haojin Zhu, Shaofeng Li, Tian Dong, Yan Meng, Yuling Chen, Zhen Liu.

**Figure 2.** Figure 2: Measurement on open-source brain tumor datasets. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the regeneration attack primitives on [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: The workflow of DIPBox. With the created audit set, the auditor measures the similarity scores based on their access to the target model/dataset. The final judgment is based on thresholding and applies to multiple downstream investigations. (b) 𝒫𝒫𝒱𝒱 ≇ 𝒫𝒫𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝓟𝓟𝓥𝓥 (a) 𝒫𝒫𝒱𝒱 ≅ 𝒫𝒫𝒜𝒜 𝓟𝓟𝓐𝓐 𝓟𝓟𝓥𝓥 𝓟𝓟𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊 Projection Distribution Samples from 𝒫𝒫𝒜𝒜 Samples from 𝒫𝒫𝒱𝒱 Samples from 𝒫𝒫𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 Outlier space … view at source ↗

**Figure 5.** Figure 5: Intuition of our audit set generation. The distribu [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Intuition of GD: Training samples tend to induce [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Robustness evaluation for defending the Attack [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 10.** Figure 10: Robustness evaluation of defending Attack [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: The trade-off between the UR and the model met [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Evaluation of metric-specific adaptive attacks. [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 15.** Figure 15: An example experiment of transfer accuracy be [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

**Figure 14.** Figure 14: Text examples of sample-level regeneration attack. [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

read the original abstract

Training datasets have tremendous proprietary value and are vulnerable to unauthorized copying. Existing defenses mainly focus on tracking individual data points, but pay little attention to the threat of dataset regeneration. Through a measurement study of public tumor datasets, we identify substantial real-world partial-dataset replication, raising concerns about potential license noncompliance. To counter the challenge of tracking previously unknown adversarial regeneration, our key insight is that regeneration that preserves model utility inevitably preserves measurable signals across multiple feature scales. We categorize these dataset features into sample-, set-, and distribution-level features and design four similarity metrics to accurately identify regeneration. Based on these metrics, we develop DIPBox, which to our knowledge is the first testing framework that tracks regeneration suspects via multi-scale similarity testing across a spectrum of defender access settings, from limited to full information. We further provide a learning-theoretic analysis that justifies these multi-scale metrics and formalizes an inherent utility--divergence trade-off, implying fundamental limits on evasive regeneration. Extensive experiments on 16 vision and text base datasets, 320 regenerated datasets, and 590 derived models validate that DIPBox outperforms previous solutions while characterizing its robustness and limits under three adaptive attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DIPBox gives a practical multi-scale detector for dataset regeneration with decent experiments, but the learning-theoretic claim that utility-preserving regenerations must leave traces in the four chosen metrics rests on assumptions that need explicit checking.

read the letter

The main contribution is a framework that treats regeneration detection as a multi-scale problem rather than point-wise watermarking. They split features into sample, set, and distribution levels, define four similarity metrics, and test across defender access regimes from black-box to full. The measurement study on tumor datasets showing real partial copies is a useful reality check, and the scale of the evaluation—16 base datasets, 320 regenerations, 590 models—plus tests against three adaptive attacks is more than most papers in this area deliver.

The learning-theoretic section formalizes a utility-divergence trade-off and is presented as justifying why evasion is limited. That is the part worth reading carefully. The abstract and stress-test note both flag the same open question: the argument assumes that any regeneration keeping downstream utility must preserve signals at all three scales under the chosen metrics. If the allowed regeneration class is not tightly specified (for example, whether arbitrary model architectures or non-convex combinations are permitted), it is possible to imagine regenerations that stay useful while dropping below detection thresholds. The paper does not appear to close that gap with a fully general proof.

The empirical results look solid on the tested regenerations, and the framework is clearly distinct from prior single-point tracking work. Still, the central guarantee depends on how broadly the theory applies.

This is for readers working on data provenance and model supply-chain security rather than core ML theory. It is worth sending to peer review because the experimental design is substantial and the multi-scale idea is new enough to merit referee scrutiny, even if the theory section needs tightening on the regeneration class.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes DIPBox, a multi-scale testing framework for detecting unauthorized dataset regeneration. It categorizes dataset features into sample-, set-, and distribution-level signals, introduces four similarity metrics, develops a testing framework spanning limited to full defender information settings, provides a learning-theoretic analysis of an inherent utility-divergence trade-off, and reports experiments on 16 base datasets, 320 regenerations, and 590 models showing outperformance over prior methods and robustness under three adaptive attacks.

Significance. If the central claims hold, the work would represent a meaningful advance in dataset provenance tracking by moving beyond single-point watermarking to multi-scale detection with formal grounding. The scale of the experimental evaluation (16 base datasets, 320 regenerations, 590 models) and the attempt to derive fundamental limits via learning theory are clear strengths that would support impact in the data-protection and ML-security communities.

major comments (1)

[Abstract and learning-theoretic analysis] Abstract and learning-theoretic analysis: the claim that 'regeneration that preserves model utility inevitably preserves measurable signals across multiple feature scales' and that the analysis 'implies fundamental limits on evasive regeneration' is load-bearing for the identification accuracy of DIPBox. The analysis does not appear to state explicit assumptions on the class of allowed regenerations (e.g., whether regenerations are restricted to convex combinations, fixed model architectures, or particular divergence families). Without such restrictions it remains possible to construct utility-preserving regenerations that drive the four chosen metrics below detection thresholds, falsifying the inevitability premise.

minor comments (2)

[Experimental evaluation] The abstract states experiments cover '16 vision and text base datasets' but does not enumerate them; an explicit list (with references) in §5 or the experimental setup would improve reproducibility.
[Metrics definition] Notation for the four similarity metrics is introduced in the abstract but their precise mathematical definitions (including any hyperparameters) should be stated with numbered equations in the main body rather than deferred to an appendix.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the scale of our experimental evaluation and the learning-theoretic contributions. We address the single major comment below, proposing a targeted revision to clarify assumptions in the analysis section.

read point-by-point responses

Referee: Abstract and learning-theoretic analysis: the claim that 'regeneration that preserves model utility inevitably preserves measurable signals across multiple feature scales' and that the analysis 'implies fundamental limits on evasive regeneration' is load-bearing for the identification accuracy of DIPBox. The analysis does not appear to state explicit assumptions on the class of allowed regenerations (e.g., whether regenerations are restricted to convex combinations, fixed model architectures, or particular divergence families). Without such restrictions it remains possible to construct utility-preserving regenerations that drive the four chosen metrics below detection thresholds, falsifying the inevitability premise.

Authors: We appreciate the referee's identification of this point. The learning-theoretic analysis formalizes a utility-divergence trade-off under the standard setting of statistical learning theory, where regenerations are modeled as transformations that preserve downstream model performance (i.e., the regenerated dataset yields models with comparable empirical risk on the task). The analysis shows that maintaining low utility loss requires the regenerated distribution to remain close in specific divergence measures that align with our multi-scale metrics. However, we agree that the manuscript does not explicitly enumerate the precise class of regenerations (e.g., excluding arbitrary non-distribution-preserving maps or architecture-specific attacks). We will revise the analysis section (and corresponding abstract phrasing) to state these assumptions explicitly, including that the inevitability result holds for regenerations that do not fundamentally alter the data-generating process beyond bounded perturbations consistent with utility preservation. We will also add a discussion of the result's scope and potential counterexamples outside these assumptions. This revision will be made without changing the core claims or experiments. revision: yes

Circularity Check

0 steps flagged

No circularity identified; derivation self-contained via external analysis and experiments

full rationale

The abstract and description present the multi-scale metrics as motivated by an empirical measurement study and justified by a separate learning-theoretic analysis formalizing a utility-divergence trade-off. No quoted equations or steps reduce the claimed predictions or limits to fitted parameters, self-definitions, or self-citation chains. Experiments on 16 base datasets, 320 regenerations, and 590 models provide independent validation. The central inevitability premise is framed as a theorem-derived limit rather than a renaming or ansatz smuggling. This meets the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the core insight about preserved signals across scales is treated as an empirical observation rather than a derived axiom.

pith-pipeline@v0.9.1-grok · 5748 in / 1145 out tokens · 20409 ms · 2026-06-26T14:13:25.905841+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 1 canonical work pages

[1]

10Duke. 2023. Software Copyright – Basics Explained. https://www.10duke.com/resources/glossary/software-copyright

2023
[2]

Goodfellow, H

Martín Abadi, Andy Chu, Ian J. Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016. ACM, 308–318

2016
[3]

Ehsan Amid, Rohan Anil, Wojciech Kotlowski, and Manfred K. Warmuth
[4]

Learning from Randomly Initialized Neural Network Features.CoRR abs/2202.06438 (2022)

arXiv 2022
[5]

arXiv. 2026. arXiv moderation. https://info.arxiv.org/help/moderation/index. html

2026
[6]

Buse Gul Atli Tekgul and N. Asokan. 2022. On the Effectiveness of Dataset Watermarking. InProceedings of the 2022 ACM on International Workshop on Se- curity and Privacy Analytics (IWSPA ’22). Association for Computing Machinery, 93–99

2022
[7]

Ms Aayushi Bansal, Dr Rewa Sharma, and Dr Mamta Kathuria. 2022. A systematic review on data scarcity problem in deep learning: solution and applications.ACM Computing Surveys (Csur)54, 10s (2022), 1–29

2022
[8]

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. A theory of learning from different domains.Machine learning79, 1 (2010), 151–175

2010
[9]

Prajjwal Bhargava, Aleksandr Drozd, and Anna Rogers. 2021. Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics. arXiv:2110.01518 [cs.CL]

arXiv 2021
[10]

Sutherland, Michael Arbel, and Arthur Gretton

Mikolaj Binkowski, Danica J. Sutherland, Michael Arbel, and Arthur Gretton
[11]

In6th International Conference on Learning Representations, ICLR 2018

Demystifying MMD GANs. In6th International Conference on Learning Representations, ICLR 2018

2018
[12]

Olivier Bousquet and André Elisseeff. 2002. Stability and generalization.Journal of machine learning research2, Mar (2002), 499–526

2002
[13]

Dingfan Chen, Raouf Kerkouche, and Mario Fritz. 2022. Private Set Generation with Discriminative Information. InAdvances in Neural Information Processing Systems (NeurIPS)

2022
[14]

Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz. 2020. GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security (CCS ’20). ACM, New York, NY, USA, 343–362

2020
[15]

Jialuo Chen, Jingyi Wang, Tinglan Peng, Youcheng Sun, Peng Cheng, Shouling Ji, Xingjun Ma, Bo Li, and Dawn Song. 2022. Copy, Right? A Testing Framework for Copyright Protection of Deep Learning Models. In43rd IEEE Symposium on Security and Privacy, SP 2022. IEEE, 824–841

2022
[16]

Jingjing Chen, Xiaobin Wang, Chen Huang, Xin Hu, Xinke Shen, and Dan Zhang
[17]

A large finer-grained affective computing EEG dataset.Scientific Data10, 1 (2023), 740

2023
[18]

Adam Coates, Andrew Ng, and Honglak Lee. 2011. An analysis of single-layer networks in unsupervised feature learning. InProceedings of the fourteenth inter- national conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 215–223

2011
[19]

Former Contributor. 2015. Five Reasons To Copyright Register Your Software Now. https://www.forbes.com/sites/oliverherzfeld/2015/10/26/five-reasons-to- copyright-your-software-now

2015
[20]

Luke N Darlow, Elliot J Crowley, Antreas Antoniou, and Amos J Storkey. 2018. Cinic-10 is not imagenet or cifar-10.arXiv preprint arXiv:1810.03505(2018)

Pith/arXiv arXiv 2018
[21]

Tian Dong, Shaofeng Li, Guoxing Chen, Minhui Xue, Haojin Zhu, and Zhen Liu
[22]

In 30th Annual Network and Distributed System Security Symposium, NDSS 2023

RAI2: Responsible Identity Audit Governing the Artificial Intelligence. In 30th Annual Network and Distributed System Security Symposium, NDSS 2023. The Internet Society

2023
[23]

Linkang Du, Min Chen, Mingyang Sun, Shouling Ji, Peng Cheng, Jiming Chen, and Zhikun Zhang. 2024. ORL-AUDITOR: Dataset Auditing in Offline Deep Reinforcement Learning. In31st Annual Network and Distributed System Security Symposium, NDSS 2024. The Internet Society

2024
[24]

Linkang Du, Xuanru Zhou, Min Chen, Chusong Zhang, Zhou Su, Peng Cheng, Jiming Chen, and Zhikun Zhang. 2025. SoK: Dataset Copyright Auditing in Machine Learning Systems. InIEEE Symposium on Security and Privacy, SP 2025. IEEE

2025
[25]

The Economist. 2024. AI firms will soon exhaust most of the internet’s data. https://www.economist.com/schools-brief/2024/07/23/ai-firms-will-soon- exhaust-most-of-the-internets-data

2024
[26]

Bent Fuglede and Flemming Topsøe. 2004. Jensen-Shannon divergence and Hilbert space embedding. InProceedings of the 2004 IEEE International Symposium on Information Theory, ISIT 2004. IEEE, 31

2004
[27]

Bronstein

Raja Giryes, Guillermo Sapiro, and Alexander M. Bronstein. 2016. Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? 14 DIPBox: A Multi-scale Testing Framework for Tracking Dataset Regeneration IEEE Trans. Signal Process.64, 13 (2016), 3444–3457

2016
[28]

Borgwardt, Malte J

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander J. Smola. 2012. A Kernel Two-Sample Test.J. Mach. Learn. Res.13 (2012), 723–773

2012
[29]

Besold, and Alexandre M

Kathrin Grosse, Lukas Bieringer, Tarek R. Besold, and Alexandre M. Alahi. 2024. Towards More Practical Threat Models in Artificial Intelligence Security. InProc. of USENIX Security

2024
[30]

Frederik Harder, Kamil Adamczewski, and Mijung Park. 2021. DP-MERF: Differ- entially Private Mean Embeddings with RandomFeatures for Practical Privacy- preserving Data Generation. InThe 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021 (Proceedings of Machine Learning Re- search, Vol. 130). PMLR, 1819–1827

2021
[31]

Moritz Hardt, Ben Recht, and Yoram Singer. 2016. Train faster, generalize better: Stability of stochastic gradient descent. InInternational conference on machine learning. PMLR, 1225–1234

2016
[32]

Sameed Hayat. 2026. Guardian News Dataset. https://www.kaggle.com/datasets/ sameedhayat/guardian-news-dataset. Accessed: 2026-06-18

2026
[33]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProc. of IEEE CVPR

2016
[34]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Proba- bilistic Models. InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020

2020
[35]

Yangsibo Huang, Chun-Yin Huang, Xiaoxiao Li, and Kai Li. 2022. A Dataset Auditing Method for Collaboratively Trained Machine Learning Models.IEEE Transactions on Medical Imaging(2022)

2022
[36]

Zonghao Huang, Neil Zhenqiang Gong, and Michael K. Reiter. 2024. A General Framework for Data-Use Auditing of ML Models. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security, 2024. ACM

2024
[37]

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. 2022. Elucidating the Design Space of Diffusion-Based Generative Models. InNeurIPS

2022
[38]

Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. 2020. Training Generative Adversarial Networks with Limited Data. InProc. NeurIPS

2020
[39]

Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2021. Alias-Free Generative Adversarial Networks. InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021. 852–863

2021
[40]

Sander Koelstra, Christian Muhl, Mohammad Soleymani, Jong-Seok Lee, Ashkan Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras
[41]

Deap: A database for emotion analysis; using physiological signals.IEEE transactions on affective computing3, 1 (2011), 18–31

2011
[42]

Berg, Peter N

Neeraj Kumar, Alexander C. Berg, Peter N. Belhumeur, and Shree K. Nayar. 2009. Attribute and simile classifiers for face verification. InIEEE 12th International Conference on Computer Vision, ICCV 2009. IEEE Computer Society, 365–372

2009
[43]

Huseyin Kusetogullari, Amir Yavariabdi, Abbas Cheddad, Håkan Grahn, and Johan Hall. 2020. ARDIS: a Swedish historical handwritten digit dataset.Neural Comput. Appl.32, 21 (2020), 16505–16518

2020
[44]

Huseyin Kusetogullari, Amir Yavariabdi, Johan Hall, and Niklas Lavesson. 2021. DIGITNET: A Deep Handwritten Digit Detection and Recognition Method Using a New Historical Handwritten Digit Dataset.Big Data Res.23 (2021), 100182

2021
[45]

Yiming Li, Yang Bai, Yong Jiang, Yong Yang, Shu-Tao Xia, and Bo Li. 2022. Untar- geted Backdoor Watermark: Towards Harmless and Stealthy Dataset Copyright Protection. InAdvances in Neural Information Processing Systems (NeurIPS)

2022
[46]

Yiming Li, Mingyan Zhu, Xue Yang, Yong Jiang, Tao Wei, and Shu-Tao Xia. 2023. Black-box Dataset Ownership Verification via Backdoor Watermarking.IEEE Transactions on Information Forensics and Security(2023), 1–1. doi:10.1109/TIFS. 2023.3265535

work page doi:10.1109/tifs 2023
[47]

Gaoyang Liu, Tianlong Xu, Xiaoqiang Ma, and Chen Wang. 2022. Your Model Trains on My Data? Protecting Intellectual Property of Training Data via Mem- bership Fingerprint Authentication.IEEE Trans. Inf. Forensics Secur.17 (2022), 1024–1037

2022
[48]

Hui Liu, Hongzhi Luo, Shaofeng Li, Tian Dong, Guoxing Chen, Yan Meng, and Haojin Zhu. 2023. Privacy Computing with Right to Be Forgotten in Trusted Execution Environment. InIEEE Global Communications Conference (GlobeCom), Kuala Lumpur, Malaysia

2023
[49]

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning Face Attributes in the Wild. InProceedings of International Conference on Computer Vision (ICCV)

2015
[50]

Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, et al. 2024. A large-scale audit of dataset licensing and attribution in AI.Nature Machine Intelligence6, 8 (2024), 975–987

2024
[51]

Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Mustafa Anis, An Dinh, Caroline Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad Alghamdi, En...

2024
[52]

Nils Lukas, Edward Jiang, Xinda Li, and Florian Kerschbaum. 2022. SoK: How Robust is Image Classification Deep Neural Network Watermarking?. In43rd IEEE Symposium on Security and Privacy, SP 2022. IEEE, 787–804

2022
[53]

Pratyush Maini, Hengrui Jia, Nicolas Papernot, and Adam Dziedzic. 2024. LLM Dataset Inference: Did you train on my dataset?. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

2024
[54]

Pratyush Maini, Mohammad Yaghini, and Nicolas Papernot. 2021. Dataset Infer- ence: Ownership Resolution in Machine Learning. InInternational Conference on Learning Representations

2021
[55]

Bertin Martens. 2018. The importance of data access regimes for artificial in- telligence and machine learning.JRC Digital Economy Working Paper 2018-09 (2018)

2018
[56]

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and An- drew Y Ng. 2011. Reading digits in natural images with unsupervised feature learning.NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011)

2011
[57]

Google Play. 2024. Intellectual Property Policy. https://support.google.com/ googleplay/android-developer/answer/9888072

arXiv 2024
[58]

Girshick, Kaiming He, and Piotr Dollár

Ilija Radosavovic, Raj Prateek Kosaraju, Ross B. Girshick, Kaiming He, and Piotr Dollár. 2020. Designing Network Design Spaces. InProc. of IEEE/CVF CVPR

2020
[59]

Arpit Rajauria. 2020. pegasus paraphrase. https://huggingface.co/tuner007/ pegasus_paraphrase

2020
[60]

Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Hervé Jégou. 2020. Radioactive data: tracing through training. InProceedings of the 37th International Conference on Machine Learning, ICML 2020 (Proceedings of Machine Learning Research, Vol. 119). PMLR, 8326–8335

2020
[61]

Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen

Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottle- necks. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR

2018
[62]

Computer Vision Foundation / IEEE Computer Society, 4510–4520
[63]

Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Y

Andrew M. Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Y. Ng. 2011. On Random Weights and Unsupervised Feature Learn- ing. InProceedings of the 28th International Conference on Machine Learning, ICML 2011. Omnipress, 1089–1096

2011
[64]

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. 2024. AI models collapse when trained on recursively generated data.Nature631, 8022 (2024), 755–759

2024
[65]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional net- works for large-scale image recognition.arXiv preprint arXiv:1409.1556(2014)

Pith/arXiv arXiv 2014
[66]

Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert R

Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert R. G. Lanckriet. 2010. Hilbert Space Embeddings and Metrics on Probability Measures.J. Mach. Learn. Res.11 (2010), 1517–1561

2010
[67]

Mingxing Tan and Quoc V. Le. 2021. EfficientNetV2: Smaller Models and Faster Training. InProceedings of the 38th International Conference on Machine Learning, ICML 2021 (Proceedings of Machine Learning Research, Vol. 139). PMLR, 10096– 10106

2021
[68]

Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well- read students learn better: The impact of student initialization on knowledge distillation.arXiv preprint arXiv:1908.08962(2019)

arXiv 2019
[69]

Hadas Unger, CHERGUELAINE Ayoub, and BOUBEKRI Faycal. 2023. CNN News Articles from 2011 to 2022. https://huggingface.co/datasets/AyoubChLin/CNN_ News_Articles_2011-2022

2023
[70]

James Vincent. 2023. Meta’s powerful AI language model has leaked online. https://www.theverge.com/2023/3/8/23629362/meta-ai-language- modelllama-leak-online-misuse

2023
[71]

Maxim Kuznetsov Vladimir Vorobev. 2023. A paraphrasing model based on ChatGPT paraphrases. https://huggingface.co/humarin/chatgpt_paraphraser_ on_T5_base

2023
[72]

Shuo Wang, Sharif Abuadbba, Sidharth Agarwal, Kristen Moore, Ruoxi Sun, Minhui Xue, Surya Nepal, Seyit Camtepe, and Salil Kanhere. 2022. PublicCheck: Public Integrity Verification for Services of Run-time Deep Models. In2023 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 1239–1256

2022
[73]

Yifei Wang, Jizhe Zhang, and Yisen Wang. 2024. Do Generated Data Always Help Contrastive Learning?. InThe Twelfth International Conference on Learning Representations

2024
[74]

Yifan Yan, Xudong Pan, Mi Zhang, and Min Yang. 2023. Rethinking White-Box Watermarks on Deep Learning Models under Neural Structural Obfuscation. In 32nd USENIX Security Symposium, USENIX Security 2023. USENIX Association, 2347–2364. 15 Tian Dong∗, Yan Meng∗, Shaofeng Li †, Guoxing Chen ∗, Yuling Chen‡, Zhen Liu ∗, Haojin Zhu ∗, and Hao Chen §

2023
[75]

Ze Yang, Can Xu, Wei Wu, and Zhoujun Li. 2019. Read, Attend and Comment: A Deep Architecture for Automatic News Comment Generation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP

2019
[76]

Association for Computational Linguistics, 5076–5088
[77]

Yuxia Zhan, Yan Meng, Yichang Xiong Lu Zhou, Xiaokuan Zhang, Lichuan Ma, Guoxing Chen, Qingqi Pei, and Haojin Zhu. 2024. VPVet: Vetting Privacy Policies of Virtual Reality Apps. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security, 2024. ACM

2024
[78]

Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. 2020. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020 (Proceedings of Machine Learning Research, Vol. 119). PMLR, 11328–11339

2020
[79]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang
[80]

In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. Computer Vision Foundation / IEEE Computer Society, 586–595

2018

Showing first 80 references.

[1] [1]

10Duke. 2023. Software Copyright – Basics Explained. https://www.10duke.com/resources/glossary/software-copyright

2023

[2] [2]

Goodfellow, H

Martín Abadi, Andy Chu, Ian J. Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016. ACM, 308–318

2016

[3] [3]

Ehsan Amid, Rohan Anil, Wojciech Kotlowski, and Manfred K. Warmuth

[4] [4]

Learning from Randomly Initialized Neural Network Features.CoRR abs/2202.06438 (2022)

arXiv 2022

[5] [5]

arXiv. 2026. arXiv moderation. https://info.arxiv.org/help/moderation/index. html

2026

[6] [6]

Buse Gul Atli Tekgul and N. Asokan. 2022. On the Effectiveness of Dataset Watermarking. InProceedings of the 2022 ACM on International Workshop on Se- curity and Privacy Analytics (IWSPA ’22). Association for Computing Machinery, 93–99

2022

[7] [7]

Ms Aayushi Bansal, Dr Rewa Sharma, and Dr Mamta Kathuria. 2022. A systematic review on data scarcity problem in deep learning: solution and applications.ACM Computing Surveys (Csur)54, 10s (2022), 1–29

2022

[8] [8]

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. A theory of learning from different domains.Machine learning79, 1 (2010), 151–175

2010

[9] [9]

Prajjwal Bhargava, Aleksandr Drozd, and Anna Rogers. 2021. Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics. arXiv:2110.01518 [cs.CL]

arXiv 2021

[10] [10]

Sutherland, Michael Arbel, and Arthur Gretton

Mikolaj Binkowski, Danica J. Sutherland, Michael Arbel, and Arthur Gretton

[11] [11]

In6th International Conference on Learning Representations, ICLR 2018

Demystifying MMD GANs. In6th International Conference on Learning Representations, ICLR 2018

2018

[12] [12]

Olivier Bousquet and André Elisseeff. 2002. Stability and generalization.Journal of machine learning research2, Mar (2002), 499–526

2002

[13] [13]

Dingfan Chen, Raouf Kerkouche, and Mario Fritz. 2022. Private Set Generation with Discriminative Information. InAdvances in Neural Information Processing Systems (NeurIPS)

2022

[14] [14]

Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz. 2020. GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security (CCS ’20). ACM, New York, NY, USA, 343–362

2020

[15] [15]

Jialuo Chen, Jingyi Wang, Tinglan Peng, Youcheng Sun, Peng Cheng, Shouling Ji, Xingjun Ma, Bo Li, and Dawn Song. 2022. Copy, Right? A Testing Framework for Copyright Protection of Deep Learning Models. In43rd IEEE Symposium on Security and Privacy, SP 2022. IEEE, 824–841

2022

[16] [16]

Jingjing Chen, Xiaobin Wang, Chen Huang, Xin Hu, Xinke Shen, and Dan Zhang

[17] [17]

A large finer-grained affective computing EEG dataset.Scientific Data10, 1 (2023), 740

2023

[18] [18]

Adam Coates, Andrew Ng, and Honglak Lee. 2011. An analysis of single-layer networks in unsupervised feature learning. InProceedings of the fourteenth inter- national conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 215–223

2011

[19] [19]

Former Contributor. 2015. Five Reasons To Copyright Register Your Software Now. https://www.forbes.com/sites/oliverherzfeld/2015/10/26/five-reasons-to- copyright-your-software-now

2015

[20] [20]

Luke N Darlow, Elliot J Crowley, Antreas Antoniou, and Amos J Storkey. 2018. Cinic-10 is not imagenet or cifar-10.arXiv preprint arXiv:1810.03505(2018)

Pith/arXiv arXiv 2018

[21] [21]

Tian Dong, Shaofeng Li, Guoxing Chen, Minhui Xue, Haojin Zhu, and Zhen Liu

[22] [22]

In 30th Annual Network and Distributed System Security Symposium, NDSS 2023

RAI2: Responsible Identity Audit Governing the Artificial Intelligence. In 30th Annual Network and Distributed System Security Symposium, NDSS 2023. The Internet Society

2023

[23] [23]

Linkang Du, Min Chen, Mingyang Sun, Shouling Ji, Peng Cheng, Jiming Chen, and Zhikun Zhang. 2024. ORL-AUDITOR: Dataset Auditing in Offline Deep Reinforcement Learning. In31st Annual Network and Distributed System Security Symposium, NDSS 2024. The Internet Society

2024

[24] [24]

Linkang Du, Xuanru Zhou, Min Chen, Chusong Zhang, Zhou Su, Peng Cheng, Jiming Chen, and Zhikun Zhang. 2025. SoK: Dataset Copyright Auditing in Machine Learning Systems. InIEEE Symposium on Security and Privacy, SP 2025. IEEE

2025

[25] [25]

The Economist. 2024. AI firms will soon exhaust most of the internet’s data. https://www.economist.com/schools-brief/2024/07/23/ai-firms-will-soon- exhaust-most-of-the-internets-data

2024

[26] [26]

Bent Fuglede and Flemming Topsøe. 2004. Jensen-Shannon divergence and Hilbert space embedding. InProceedings of the 2004 IEEE International Symposium on Information Theory, ISIT 2004. IEEE, 31

2004

[27] [27]

Bronstein

Raja Giryes, Guillermo Sapiro, and Alexander M. Bronstein. 2016. Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? 14 DIPBox: A Multi-scale Testing Framework for Tracking Dataset Regeneration IEEE Trans. Signal Process.64, 13 (2016), 3444–3457

2016

[28] [28]

Borgwardt, Malte J

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander J. Smola. 2012. A Kernel Two-Sample Test.J. Mach. Learn. Res.13 (2012), 723–773

2012

[29] [29]

Besold, and Alexandre M

Kathrin Grosse, Lukas Bieringer, Tarek R. Besold, and Alexandre M. Alahi. 2024. Towards More Practical Threat Models in Artificial Intelligence Security. InProc. of USENIX Security

2024

[30] [30]

Frederik Harder, Kamil Adamczewski, and Mijung Park. 2021. DP-MERF: Differ- entially Private Mean Embeddings with RandomFeatures for Practical Privacy- preserving Data Generation. InThe 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021 (Proceedings of Machine Learning Re- search, Vol. 130). PMLR, 1819–1827

2021

[31] [31]

Moritz Hardt, Ben Recht, and Yoram Singer. 2016. Train faster, generalize better: Stability of stochastic gradient descent. InInternational conference on machine learning. PMLR, 1225–1234

2016

[32] [32]

Sameed Hayat. 2026. Guardian News Dataset. https://www.kaggle.com/datasets/ sameedhayat/guardian-news-dataset. Accessed: 2026-06-18

2026

[33] [33]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProc. of IEEE CVPR

2016

[34] [34]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Proba- bilistic Models. InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020

2020

[35] [35]

Yangsibo Huang, Chun-Yin Huang, Xiaoxiao Li, and Kai Li. 2022. A Dataset Auditing Method for Collaboratively Trained Machine Learning Models.IEEE Transactions on Medical Imaging(2022)

2022

[36] [36]

Zonghao Huang, Neil Zhenqiang Gong, and Michael K. Reiter. 2024. A General Framework for Data-Use Auditing of ML Models. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security, 2024. ACM

2024

[37] [37]

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. 2022. Elucidating the Design Space of Diffusion-Based Generative Models. InNeurIPS

2022

[38] [38]

Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. 2020. Training Generative Adversarial Networks with Limited Data. InProc. NeurIPS

2020

[39] [39]

Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2021. Alias-Free Generative Adversarial Networks. InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021. 852–863

2021

[40] [40]

Sander Koelstra, Christian Muhl, Mohammad Soleymani, Jong-Seok Lee, Ashkan Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras

[41] [41]

Deap: A database for emotion analysis; using physiological signals.IEEE transactions on affective computing3, 1 (2011), 18–31

2011

[42] [42]

Berg, Peter N

Neeraj Kumar, Alexander C. Berg, Peter N. Belhumeur, and Shree K. Nayar. 2009. Attribute and simile classifiers for face verification. InIEEE 12th International Conference on Computer Vision, ICCV 2009. IEEE Computer Society, 365–372

2009

[43] [43]

Huseyin Kusetogullari, Amir Yavariabdi, Abbas Cheddad, Håkan Grahn, and Johan Hall. 2020. ARDIS: a Swedish historical handwritten digit dataset.Neural Comput. Appl.32, 21 (2020), 16505–16518

2020

[44] [44]

Huseyin Kusetogullari, Amir Yavariabdi, Johan Hall, and Niklas Lavesson. 2021. DIGITNET: A Deep Handwritten Digit Detection and Recognition Method Using a New Historical Handwritten Digit Dataset.Big Data Res.23 (2021), 100182

2021

[45] [45]

Yiming Li, Yang Bai, Yong Jiang, Yong Yang, Shu-Tao Xia, and Bo Li. 2022. Untar- geted Backdoor Watermark: Towards Harmless and Stealthy Dataset Copyright Protection. InAdvances in Neural Information Processing Systems (NeurIPS)

2022

[46] [46]

Yiming Li, Mingyan Zhu, Xue Yang, Yong Jiang, Tao Wei, and Shu-Tao Xia. 2023. Black-box Dataset Ownership Verification via Backdoor Watermarking.IEEE Transactions on Information Forensics and Security(2023), 1–1. doi:10.1109/TIFS. 2023.3265535

work page doi:10.1109/tifs 2023

[47] [47]

Gaoyang Liu, Tianlong Xu, Xiaoqiang Ma, and Chen Wang. 2022. Your Model Trains on My Data? Protecting Intellectual Property of Training Data via Mem- bership Fingerprint Authentication.IEEE Trans. Inf. Forensics Secur.17 (2022), 1024–1037

2022

[48] [48]

Hui Liu, Hongzhi Luo, Shaofeng Li, Tian Dong, Guoxing Chen, Yan Meng, and Haojin Zhu. 2023. Privacy Computing with Right to Be Forgotten in Trusted Execution Environment. InIEEE Global Communications Conference (GlobeCom), Kuala Lumpur, Malaysia

2023

[49] [49]

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning Face Attributes in the Wild. InProceedings of International Conference on Computer Vision (ICCV)

2015

[50] [50]

Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, et al. 2024. A large-scale audit of dataset licensing and attribution in AI.Nature Machine Intelligence6, 8 (2024), 975–987

2024

[51] [51]

Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Mustafa Anis, An Dinh, Caroline Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad Alghamdi, En...

2024

[52] [52]

Nils Lukas, Edward Jiang, Xinda Li, and Florian Kerschbaum. 2022. SoK: How Robust is Image Classification Deep Neural Network Watermarking?. In43rd IEEE Symposium on Security and Privacy, SP 2022. IEEE, 787–804

2022

[53] [53]

Pratyush Maini, Hengrui Jia, Nicolas Papernot, and Adam Dziedzic. 2024. LLM Dataset Inference: Did you train on my dataset?. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

2024

[54] [54]

Pratyush Maini, Mohammad Yaghini, and Nicolas Papernot. 2021. Dataset Infer- ence: Ownership Resolution in Machine Learning. InInternational Conference on Learning Representations

2021

[55] [55]

Bertin Martens. 2018. The importance of data access regimes for artificial in- telligence and machine learning.JRC Digital Economy Working Paper 2018-09 (2018)

2018

[56] [56]

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and An- drew Y Ng. 2011. Reading digits in natural images with unsupervised feature learning.NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011)

2011

[57] [57]

Google Play. 2024. Intellectual Property Policy. https://support.google.com/ googleplay/android-developer/answer/9888072

arXiv 2024

[58] [58]

Girshick, Kaiming He, and Piotr Dollár

Ilija Radosavovic, Raj Prateek Kosaraju, Ross B. Girshick, Kaiming He, and Piotr Dollár. 2020. Designing Network Design Spaces. InProc. of IEEE/CVF CVPR

2020

[59] [59]

Arpit Rajauria. 2020. pegasus paraphrase. https://huggingface.co/tuner007/ pegasus_paraphrase

2020

[60] [60]

Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Hervé Jégou. 2020. Radioactive data: tracing through training. InProceedings of the 37th International Conference on Machine Learning, ICML 2020 (Proceedings of Machine Learning Research, Vol. 119). PMLR, 8326–8335

2020

[61] [61]

Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen

Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottle- necks. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR

2018

[62] [62]

Computer Vision Foundation / IEEE Computer Society, 4510–4520

[63] [63]

Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Y

Andrew M. Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Y. Ng. 2011. On Random Weights and Unsupervised Feature Learn- ing. InProceedings of the 28th International Conference on Machine Learning, ICML 2011. Omnipress, 1089–1096

2011

[64] [64]

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. 2024. AI models collapse when trained on recursively generated data.Nature631, 8022 (2024), 755–759

2024

[65] [65]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional net- works for large-scale image recognition.arXiv preprint arXiv:1409.1556(2014)

Pith/arXiv arXiv 2014

[66] [66]

Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert R

Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert R. G. Lanckriet. 2010. Hilbert Space Embeddings and Metrics on Probability Measures.J. Mach. Learn. Res.11 (2010), 1517–1561

2010

[67] [67]

Mingxing Tan and Quoc V. Le. 2021. EfficientNetV2: Smaller Models and Faster Training. InProceedings of the 38th International Conference on Machine Learning, ICML 2021 (Proceedings of Machine Learning Research, Vol. 139). PMLR, 10096– 10106

2021

[68] [68]

Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well- read students learn better: The impact of student initialization on knowledge distillation.arXiv preprint arXiv:1908.08962(2019)

arXiv 2019

[69] [69]

Hadas Unger, CHERGUELAINE Ayoub, and BOUBEKRI Faycal. 2023. CNN News Articles from 2011 to 2022. https://huggingface.co/datasets/AyoubChLin/CNN_ News_Articles_2011-2022

2023

[70] [70]

James Vincent. 2023. Meta’s powerful AI language model has leaked online. https://www.theverge.com/2023/3/8/23629362/meta-ai-language- modelllama-leak-online-misuse

2023

[71] [71]

Maxim Kuznetsov Vladimir Vorobev. 2023. A paraphrasing model based on ChatGPT paraphrases. https://huggingface.co/humarin/chatgpt_paraphraser_ on_T5_base

2023

[72] [72]

Shuo Wang, Sharif Abuadbba, Sidharth Agarwal, Kristen Moore, Ruoxi Sun, Minhui Xue, Surya Nepal, Seyit Camtepe, and Salil Kanhere. 2022. PublicCheck: Public Integrity Verification for Services of Run-time Deep Models. In2023 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 1239–1256

2022

[73] [73]

Yifei Wang, Jizhe Zhang, and Yisen Wang. 2024. Do Generated Data Always Help Contrastive Learning?. InThe Twelfth International Conference on Learning Representations

2024

[74] [74]

Yifan Yan, Xudong Pan, Mi Zhang, and Min Yang. 2023. Rethinking White-Box Watermarks on Deep Learning Models under Neural Structural Obfuscation. In 32nd USENIX Security Symposium, USENIX Security 2023. USENIX Association, 2347–2364. 15 Tian Dong∗, Yan Meng∗, Shaofeng Li †, Guoxing Chen ∗, Yuling Chen‡, Zhen Liu ∗, Haojin Zhu ∗, and Hao Chen §

2023

[75] [75]

Ze Yang, Can Xu, Wei Wu, and Zhoujun Li. 2019. Read, Attend and Comment: A Deep Architecture for Automatic News Comment Generation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP

2019

[76] [76]

Association for Computational Linguistics, 5076–5088

[77] [77]

Yuxia Zhan, Yan Meng, Yichang Xiong Lu Zhou, Xiaokuan Zhang, Lichuan Ma, Guoxing Chen, Qingqi Pei, and Haojin Zhu. 2024. VPVet: Vetting Privacy Policies of Virtual Reality Apps. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security, 2024. ACM

2024

[78] [78]

Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. 2020. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020 (Proceedings of Machine Learning Research, Vol. 119). PMLR, 11328–11339

2020

[79] [79]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang

[80] [80]

In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. Computer Vision Foundation / IEEE Computer Society, 586–595

2018