arxiv: 2310.16789 · v3 · pith:AVYX74M7new · submitted 2023-10-25 · 💻 cs.CL · cs.CR· cs.LG

Detecting Pretraining Data from Large Language Models

Weijia Shi , Anirudh Ajith , Mengzhou Xia , Yangsibo Huang , Daogao Liu , Terra Blevins , Danqi Chen , Luke Zettlemoyer This is my paper

Pith reviewed 2026-05-17 18:01 UTC · model grok-4.3

classification 💻 cs.CL cs.CRcs.LG

keywords pretraining data detectionmembership inferencelarge language modelsMin-K% ProbWIKIMIA benchmarkcopyright detectionmachine unlearningdata contamination

0 comments

The pith

Min-K% Prob detects if text was in an LLM's pretraining data by averaging the lowest-probability tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the problem of identifying whether a given text appeared in an LLM's training data when only black-box access to the model is possible. It creates the WIKIMIA benchmark that splits data into pre- and post-training periods to provide reliable ground truth. The proposed Min-K% Prob method rests on the observation that unseen text tends to contain a small number of tokens the model assigns unusually low probability. This approach requires no reference model or knowledge of the training corpus and improves detection accuracy by 7.4 percent on the benchmark. The method is then shown to work in practical settings such as spotting copyrighted books and checking whether examples contaminated downstream tasks.

Core claim

Min-K% Prob works by selecting the K percent of tokens in an input that receive the smallest log probabilities under the target LLM and averaging those values; lower scores indicate the text is more likely to have been seen during pretraining. The paper shows this simple statistic outperforms prior reference-model methods on the WIKIMIA benchmark and remains effective when applied to copyrighted-book detection, downstream contamination checks, and verification of machine-unlearning success.

What carries the argument

Min-K% Prob, a membership score computed from the average log probability of the K lowest-probability tokens in the input sequence under the black-box LLM.

If this is right

Copyright holders can scan published books against released LLMs to find unauthorized use.
Developers can check whether benchmark examples leaked into pretraining and inflated reported results.
Auditors can test whether machine-unlearning procedures actually removed specific private examples.
Detection works without any auxiliary training or access to the original pretraining corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the method scales to frontier models, it could support regulatory requirements that companies disclose or remove specific data sources.
The same low-probability outlier idea might extend to detecting training data in image or audio models.
Repeated application across many queries could allow approximate reconstruction of which domains dominated an LLM's training mix.

Load-bearing premise

Unseen text is likely to contain a few outlier words that the model assigns very low probability, while text seen in training is less likely to have such outliers.

What would settle it

Train an LLM on a fully known corpus, hold out a test set of unseen text, and measure whether Min-K% Prob assigns reliably lower scores to the held-out texts than to the training texts.

read the original abstract

Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but certain that it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently have no way to know which data of these types is included or in what proportions. In this paper, we study the pretraining data detection problem: given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that uses data created before and after model training to support gold truth detection. We also introduce a new detection method Min-K% Prob based on a simple hypothesis: an unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities. Min-K% Prob can be applied without any knowledge about the pretraining corpus or any additional training, departing from previous detection methods that require training a reference model on data that is similar to the pretraining data. Moreover, our experiments demonstrate that Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previous methods. We apply Min-K% Prob to three real-world scenarios, copyrighted book detection, contaminated downstream example detection and privacy auditing of machine unlearning, and find it a consistently effective solution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Min-K% Prob gives a simple black-box detector that improves on priors, but WIKIMIA's temporal split risks conflating time-based shifts with actual membership.

read the letter

The main thing here is that the paper introduces Min-K% Prob, a no-reference-model method that flags potential pretraining membership by averaging the lowest K% token probabilities, and pairs it with WIKIMIA, a benchmark that treats pre-cutoff Wikipedia text as positive examples and post-cutoff text as negative. It claims a 7.4% gain over earlier detectors and shows usable results on copyrighted books, benchmark contamination, and unlearning audits. That is the practical contribution worth noting first. The approach is genuinely lighter than prior work because it skips training any shadow model on proxy data and works from black-box access alone. The three applied scenarios give some indication that the signal is stable enough to try in practice without heavy setup. The core hypothesis is also stated plainly: unseen text should produce more extreme low-probability outliers than seen text. The soft spot is the benchmark itself. WIKIMIA assigns labels solely by creation date relative to the training cutoff. That leaves open the possibility that post-cutoff articles differ in vocabulary, style, or topic in ways that systematically lower token probabilities for reasons unrelated to whether any specific sequence appeared in training. If the detector is mostly catching those distributional differences, the reported improvement does not cleanly demonstrate membership detection inside a fixed distribution. Additional controls that vary membership while holding time and style fixed would have strengthened the case. This work is aimed at people who audit deployed LLMs for data leakage, copyright, or contamination. The method is straightforward to reimplement and the problem is timely, so the paper deserves a serious referee. I would send it to review with a request that reviewers check the benchmark for temporal confounds and test the detector on other held-out sets.

Referee Report

3 major / 2 minor

Summary. The paper introduces Min-K% Prob, a black-box method to detect whether a given text sequence was included in an LLM's pretraining data. The approach rests on the hypothesis that unseen texts are more likely to contain a small number of outlier tokens with unusually low model probabilities. To evaluate this, the authors propose the WIKIMIA benchmark, which labels Wikipedia articles published before a model's training cutoff as positive (seen) examples and post-cutoff articles as negative (unseen) examples. Experiments report a 7.4% absolute improvement over prior reference-model baselines on WIKIMIA, and the method is applied to three practical tasks: copyrighted-book detection, downstream-data contamination checks, and auditing machine-unlearning privacy guarantees.

Significance. If the core hypothesis holds and the benchmark isolates membership rather than temporal shift, the work would supply a simple, training-free auditing tool useful for copyright, privacy, and contamination analyses. The dynamic nature of WIKIMIA is a practical contribution that can be updated as new models are released. The absence of any need for a reference model trained on similar data is a clear methodological advantage over earlier approaches.

major comments (3)

[§3] §3 (WIKIMIA construction): The benchmark labels post-cutoff Wikipedia articles as negative examples solely on the basis of publication date. This risks conflating pretraining membership with temporal distribution shift in topics, vocabulary, or writing style; the reported 7.4% gain could therefore reflect the model's general difficulty with newer text rather than absence from the training corpus. Additional controls (e.g., topic-matched pre/post pairs or perplexity-matched baselines) are needed to establish that the signal is membership-specific.
[Abstract and §4] Abstract and §4 (core hypothesis): The claim that 'an unseen example is likely to contain a few outlier words with low probabilities' is presented without direct ablation or statistical test isolating the contribution of the lowest-K% tokens versus overall perplexity or length. Because this hypothesis is load-bearing for both the method and the benchmark results, an explicit verification (e.g., comparing Min-K% against full-sequence perplexity or random-K% baselines) should be added.
[Results] Results section (performance numbers): The 7.4% improvement on WIKIMIA is stated without reported standard deviations, number of runs, or statistical significance tests against the baselines. Given that the central empirical claim rests on this margin, confidence intervals or p-values must be supplied to support the superiority statement.

minor comments (2)

[Methods] Provide an explicit mathematical definition or pseudocode for Min-K% Prob (including how ties and tokenization edge cases are handled) in the methods section.
[Figures] Figure legends and axis labels in the experimental plots should be expanded to make the comparison between Min-K% and prior methods immediately readable without reference to the caption.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses

Referee: [§3] §3 (WIKIMIA construction): The benchmark labels post-cutoff Wikipedia articles as negative examples solely on the basis of publication date. This risks conflating pretraining membership with temporal distribution shift in topics, vocabulary, or writing style; the reported 7.4% gain could therefore reflect the model's general difficulty with newer text rather than absence from the training corpus. Additional controls (e.g., topic-matched pre/post pairs or perplexity-matched baselines) are needed to establish that the signal is membership-specific.

Authors: We acknowledge that temporal distribution shift is a valid concern and could partially explain performance differences. To strengthen the benchmark, we will add controls using topic-matched pre- and post-cutoff Wikipedia article pairs in a revised §3. We will also report results against perplexity-matched baselines to help isolate membership effects from general difficulty with newer text. These additions will clarify the extent to which the signal is membership-specific while preserving the dynamic and practical nature of WIKIMIA. revision: partial
Referee: [Abstract and §4] Abstract and §4 (core hypothesis): The claim that 'an unseen example is likely to contain a few outlier words with low probabilities' is presented without direct ablation or statistical test isolating the contribution of the lowest-K% tokens versus overall perplexity or length. Because this hypothesis is load-bearing for both the method and the benchmark results, an explicit verification (e.g., comparing Min-K% against full-sequence perplexity or random-K% baselines) should be added.

Authors: We agree that an explicit ablation would provide stronger support for the core hypothesis. In the revised manuscript we will add a dedicated ablation subsection in §4 that directly compares Min-K% Prob against full-sequence perplexity and random-K% token selection baselines, including statistical tests to quantify the contribution of focusing on the lowest-probability tokens. revision: yes
Referee: [Results] Results section (performance numbers): The 7.4% improvement on WIKIMIA is stated without reported standard deviations, number of runs, or statistical significance tests against the baselines. Given that the central empirical claim rests on this margin, confidence intervals or p-values must be supplied to support the superiority statement.

Authors: We thank the referee for highlighting this omission. We will revise the results section to report standard deviations across multiple runs, explicitly state the number of runs performed, and include statistical significance tests (p-values) comparing Min-K% Prob to the baselines to substantiate the reported improvement. revision: yes

Circularity Check

0 steps flagged

Min-K% Prob is a direct heuristic on target LLM probabilities, validated on external temporal benchmark

full rationale

The paper defines Min-K% Prob explicitly as the mean of the bottom k% token log-probabilities produced by the queried LLM itself, under the hypothesis that unseen text contains more low-probability outliers. This quantity is computed once per example and compared against a threshold; no parameter is fitted to the target detection labels, and no prior result from the same authors is invoked to force uniqueness or to smuggle an ansatz. The WIKIMIA benchmark supplies gold labels via an independent temporal cutoff that does not reference the Min-K% statistic, so the reported improvement is an empirical measurement rather than a definitional identity. No step in the derivation reduces the claimed detector to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about probability outliers and introduces a tunable percentage parameter whose exact value is not fixed by theory.

free parameters (1)

K in Min-K%
The percentage threshold for selecting the lowest-probability tokens is a tunable hyperparameter whose specific value affects detection performance but is not derived from first principles.

axioms (1)

domain assumption An unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities.
This hypothesis is explicitly stated as the basis for the Min-K% Prob method in the abstract.

pith-pipeline@v0.9.0 · 5616 in / 1372 out tokens · 64853 ms · 2026-05-17T18:01:57.778273+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Privacy Auditing with Zero (0) Training Run
cs.CR 2026-05 unverdicted novelty 8.0

Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.
Pretraining Exposure Explains Popularity Judgments in Large Language Models
cs.CL 2026-05 unverdicted novelty 8.0

LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
Learning the Signature of Memorization in Autoregressive Language Models
cs.CL 2026-04 accept novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
cs.LG 2024-04 conditional novelty 8.0

NPO enables stable unlearning of 50%+ training data in LLMs on TOFU by making collapse exponentially slower than gradient ascent, preserving sensible outputs where prior methods fail.
DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction
cs.CV 2026-05 unverdicted novelty 7.0

DistractMIA performs output-only black-box membership inference on vision-language models by inserting semantic distractors and measuring shifts in generated text responses.
Dataset Watermarking for Closed LLMs with Provable Detection
cs.LG 2026-05 unverdicted novelty 7.0

A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tunin...
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles
cs.AI 2026-04 unverdicted novelty 7.0

A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning
cs.CL 2026-05 unverdicted novelty 6.0

TokenUnlearn identifies critical tokens via masking and entropy signals then applies hard selection or soft weighting to unlearn only those tokens, yielding better forgetting and retained utility than sequence-level b...
Adaptive Defense Orchestration for RAG: A Sentinel-Strategist Architecture against Multi-Vector Attacks
cs.CR 2026-04 unverdicted novelty 6.0

A context-aware Sentinel-Strategist system for RAG selectively applies defenses to block membership inference and data poisoning while recovering most retrieval utility compared to always-on defense stacks.
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
cs.CL 2026-04 unverdicted novelty 6.0

SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.
Representation-Guided Parameter-Efficient LLM Unlearning
cs.CL 2026-04 unverdicted novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Filling the Gaps: Selective Knowledge Augmentation for LLM Recommenders
cs.IR 2026-04 unverdicted novelty 6.0

KnowSA_CKP uses comparative knowledge probing to selectively augment LLM prompts for items with knowledge gaps, improving recommendation accuracy and context efficiency.
Auditing Data Membership in Reinforcement Learning With Verifiable Rewards
cs.CR 2025-11 unverdicted novelty 6.0

DIBA detects membership of prompts in RLVR training by measuring reward success changes and policy behavioral drift between pre- and post-RLVR model checkpoints.
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
cs.CL 2025-01 conditional novelty 6.0

Constitutional Classifiers trained on synthetic data from natural language constitutions defend LLMs against universal jailbreaks, with no successful bypass found in over 3000 hours of red teaming and only minor deplo...
DataComp-LM: In search of the next generation of training sets for language models
cs.LG 2024-06 unverdicted novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
TOFU: A Task of Fictitious Unlearning for LLMs
cs.LG 2024-01 conditional novelty 6.0

TOFU is a new benchmark with synthetic profiles and metrics demonstrating that existing unlearning algorithms for LLMs fail to achieve effective forgetting of targeted information.
Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance
cs.LG 2026-05 unverdicted novelty 5.0

Stable-GFlowNet improves training stability and attack diversity in LLM red-teaming by eliminating Z estimation via contrastive trajectory balance while preserving GFN optimality.

Reference graph

Works this paper leans on

130 extracted references · 130 canonical work pages · cited by 19 Pith papers · 7 internal anchors

[1]

Stability of stochastic gradient descent on nonsmooth convex losses

Raef Bassily, Vitaly Feldman, Crist \'o bal Guzm \'a n, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. Advances in Neural Information Processing Systems, 33: 0 4381--4391, 2020

work page 2020
[2]

Pythia: A suite for analyzing large language models across training and scaling, 2023

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023

work page 2023
[3]

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT-NeoX-20B : An open-source autoregressive language model. In Proceedings of the ACL Workshop on C...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Machine unlearning

Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp.\ 141--159. IEEE, 2021

work page 2021
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 1901
[6]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020 b

work page 1901
[7]

Extracting training data from large language models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp.\ 2633--2650, 2021

work page 2021
[8]

Membership inference attacks from first principles

Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pp.\ 1897--1914. IEEE, 2022

work page 2022
[11]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019

work page 2019
[13]

Glam: Efficient scaling of language models with mixture-of-experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp.\ 5547--5569. PMLR, 2022

work page 2022
[14]

Who’s harry potter? approximate unlearning in llms.arXiv preprint arXiv:2310.02238,

Ronen Eldan and Mark Russinovich. Who's Harry Potter ? approximate unlearning in LLMs . arXiv preprint arXiv:2310.02238, 2023

work page arXiv 2023
[15]

Does learning require memorization? a short tale about a long tail

Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp.\ 954--959, 2020

work page 2020
[17]

SimCSE : Simple contrastive learning of sentence embeddings

Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE : Simple contrastive learning of sentence embeddings. In Empirical Methods in Natural Language Processing (EMNLP), 2021

work page 2021
[18]

Making ai forget you: Data deletion in machine learning

Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. Making ai forget you: Data deletion in machine learning. Advances in neural information processing systems, 32, 2019

work page 2019
[20]

Recovering private text in federated learning of language models

Samyak Gupta, Yangsibo Huang, Zexuan Zhong, Tianyu Gao, Kai Li, and Danqi Chen. Recovering private text in federated learning of language models. Advances in Neural Information Processing Systems, 35: 0 8130--8143, 2022

work page 2022
[21]

Adaptive machine unlearning

Varun Gupta, Christopher Jung, Seth Neel, Aaron Roth, Saeed Sharifi-Malvajerdi, and Chris Waites. Adaptive machine unlearning. Advances in Neural Information Processing Systems, 34: 0 16319--16330, 2021

work page 2021
[22]

Train faster, generalize better: Stability of stochastic gradient descent

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning, pp.\ 1225--1234. PMLR, 2016

work page 2016
[23]

A dataset auditing method for collaboratively trained machine learning models

Yangsibo Huang, Chun-Yin Huang, Xiaoxiao Li, and Kai Li. A dataset auditing method for collaboratively trained machine learning models. IEEE Transactions on Medical Imaging, 2022

work page 2022
[24]

Approximate data deletion from machine learning models

Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou. Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, pp.\ 2008--2016. PMLR, 2021

work page 2008
[26]

Auditing differentially private machine learning: How private is private sgd? Advances in Neural Information Processing Systems, 33: 0 22205--22216, 2020

Matthew Jagielski, Jonathan Ullman, and Alina Oprea. Auditing differentially private machine learning: How private is private sgd? Advances in Neural Information Processing Systems, 33: 0 22205--22216, 2020

work page 2020
[27]

Evaluating differentially private machine learning in practice

Bargav Jayaraman and David Evans. Evaluating differentially private machine learning in practice. In 28th USENIX Security Symposium (USENIX Security 19), pp.\ 1895--1912, 2019

work page 1912
[28]

Deduplicating training data mitigates privacy risks in language models

Nikhil Kandpal, Eric Wallace, and Colin Raffel. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pp.\ 10697--10707. PMLR, 2022

work page 2022
[29]

California consumer privacy act, 2018

California State Legislature. California consumer privacy act, 2018. URL https://oag.ca.gov/privacy/ccpa

work page 2018
[30]

Stolen memories: Leveraging model memorization for calibrated \ White-Box \ membership inference

Klas Leino and Matt Fredrikson. Stolen memories: Leveraging model memorization for calibrated \ White-Box \ membership inference. In 29th USENIX security symposium (USENIX Security 20), pp.\ 1605--1622, 2020

work page 2020
[31]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.\ 74--81, 2004

work page 2004
[32]

Truthfulqa: Measuring how models mimic human falsehoods, 2021

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2021

work page 2021
[35]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 142--150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. UR...

work page 2011
[36]

Data contamination: From memorization to exploitation

Inbal Magar and Roy Schwartz. Data contamination: From memorization to exploitation. ArXiv, abs/2203.08242, 2022. URL https://api.semanticscholar.org/CorpusID:247475929

work page arXiv 2022
[38]

Data portraits: Recording foundation model training data, 2023

Marc Marone and Benjamin Van Durme . Data portraits: Recording foundation model training data, 2023. URL https://arxiv.org/abs/2303.03919

work page arXiv 2023
[43]

Manning, and Chelsea Finn

Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn. Detectgpt: Zero-shot machine-generated text detection using probability curvature, 2023. URL https://arxiv.org/abs/2301.11305

work page arXiv 2023
[44]

Maximilian Mozes, Xuanli He, Bennett Kleinberg, and Lewis D. Griffin. Use of llms for illicit purposes: Threats, prevention measures, and vulnerabilities, 2023

work page 2023
[45]

Gpt-4 and professional benchmarks: the wrong answer to the wrong question, 2023

Arvind Narayanan. Gpt-4 and professional benchmarks: the wrong answer to the wrong question, 2023. URL https://www.aisnakeoil.com/p/gpt-4-and-professional-benchmarks

work page 2023
[46]

Adversary instantiation: Lower bounds for differentially private machine learning

Milad Nasr, Shuang Songi, Abhradeep Thakurta, Nicolas Papernot, and Nicholas Carlin. Adversary instantiation: Lower bounds for differentially private machine learning. In 2021 IEEE Symposium on security and privacy (SP), pp.\ 866--882. IEEE, 2021

work page 2021
[48]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[49]

Did chatgpt cheat on your test?, 2023

Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, and Eneko Agirre. Did chatgpt cheat on your test?, 2023. URL https://hitz-zentroa.github.io/lm-contamination/blog/

work page 2023
[50]

Remember what you want to forget: Algorithms for machine unlearning

Ayush Sekhari, Jayadev Acharya, Gautam Kamath, and Ananda Theertha Suresh. Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems, 34: 0 18075--18086, 2021

work page 2021
[51]

Membership inference attacks against NLP classification models

Virat Shejwalkar, Huseyin A Inan, Amir Houmansadr, and Robert Sim. Membership inference attacks against NLP classification models. In NeurIPS 2021 Workshop Privacy in Machine Learning, 2021. URL https://openreview.net/forum?id=74lwg5oxheC

work page 2021
[52]

Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov

R. Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp.\ 3--18, 2016

work page 2017
[53]

Membership inference attacks against machine learning models

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pp.\ 3--18. IEEE, 2017

work page 2017
[54]

Auditing data provenance in text-generation models

Congzheng Song and Vitaly Shmatikov. Auditing data provenance in text-generation models. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.\ 196--206, 2019

work page 2019
[57]

Redpajama: An open source recipe to reproduce llama training dataset, 2023

TogetherCompute. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data

work page 2023
[59]

Llama 2: Open foundation and fine-tuned chat models, 2023 b

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page 2023
[60]

The eu general data protection regulation (gdpr)

Paul Voigt and Axel Von dem Bussche. The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing, 10 0 (3152676): 0 10--5555, 2017

work page 2017
[61]

On the importance of difficulty calibration in membership inference attacks

Lauren Watson, Chuan Guo, Graham Cormode, and Alexandre Sablayrolles. On the importance of difficulty calibration in membership inference attacks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=3eIrli0TwQ

work page 2022
[62]

Dai, and Quoc V Le

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR

work page 2022
[63]

according to

Orion Weller, Marc Marone, Nathaniel Weir, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. "according to ..." prompting language models improves quoting from pre-training data, 2023

work page 2023
[64]

Deltagrad: Rapid retraining of machine learning models

Yinjun Wu, Edgar Dobriban, and Susan Davidson. Deltagrad: Rapid retraining of machine learning models. In International Conference on Machine Learning, pp.\ 10355--10366. PMLR, 2020

work page 2020
[65]

Learning with recoverable forgetting

Jingwen Ye, Yifang Fu, Jie Song, Xingyi Yang, Songhua Liu, Xin Jin, Mingli Song, and Xinchao Wang. Learning with recoverable forgetting. In European Conference on Computer Vision, pp.\ 87--103. Springer, 2022

work page 2022
[66]

Privacy risk in machine learning: Analyzing the connection to overfitting

Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp.\ 268--282, 2018 a . doi:10.1109/CSF.2018.00027

work page doi:10.1109/csf.2018.00027 2018
[67]

Privacy risk in machine learning: Analyzing the connection to overfitting

Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st computer security foundations symposium (CSF), pp.\ 268--282. IEEE, 2018 b

work page 2018
[68]

u hle, Andrew Paverd, Olga Ohrimenko, Boris K \

Santiago Zanella-B \'e guelin, Lukas Wutschitz, Shruti Tople, Victor R \"u hle, Andrew Paverd, Olga Ohrimenko, Boris K \"o pf, and Marc Brockschmidt. Analyzing information leakage of updates to natural language models. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security, pp.\ 363--375, 2020

work page 2020
[71]

International conference on machine learning , pages=

Train faster, generalize better: Stability of stochastic gradient descent , author=. International conference on machine learning , pages=. 2016 , organization=

work page 2016
[72]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

The pile: An 800gb dataset of diverse text for language modeling , author=. arXiv preprint arXiv:2101.00027 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Advances in Neural Information Processing Systems , volume=

Stability of stochastic gradient descent on nonsmooth convex losses , author=. Advances in Neural Information Processing Systems , volume=

work page
[74]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page
[75]

2017 IEEE Symposium on Security and Privacy (SP) , pages=

Membership inference attacks against machine learning models , author=. 2017 IEEE Symposium on Security and Privacy (SP) , pages=

work page 2017
[76]

ArXiv , year=

Data Contamination: From Memorization to Exploitation , author=. ArXiv , year=

work page
[77]

Arvind Narayanan , title =

work page
[78]

2023 , eprint=

Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities , author=. 2023 , eprint=

work page 2023
[79]

arXiv preprint arXiv:2305.00118 , year=

Speak, memory: An archaeology of books known to chatgpt/gpt-4 , author=. arXiv preprint arXiv:2305.00118 , year=

work page arXiv
[80]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[81]

OPT: Open Pre-trained Transformer Language Models

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[82]

Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting , year=

Yeom, Samuel and Giacomelli, Irene and Fredrikson, Matt and Jha, Somesh , booktitle=. Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting , year=

work page
[83]

Advances in neural information processing systems , volume=

A neural probabilistic language model , author=. Advances in neural information processing systems , volume=

work page
[84]

2023 , eprint=

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. 2023 , eprint=

work page 2023
[85]

Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, USVSN Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , booktitle=

work page
[86]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[87]

Understanding Membership Inferences on Well-Generalized Learning Models

Understanding membership inferences on well-generalized learning models , author=. arXiv preprint arXiv:1802.04889 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[88]

International Conference on Learning Representations , year=

On the Importance of Difficulty Calibration in Membership Inference Attacks , author=. International Conference on Learning Representations , year=

work page
[89]

arXiv preprint arXiv:2308.04430 , year=

SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore , author=. arXiv preprint arXiv:2308.04430 , year=

work page arXiv
[90]

2022 IEEE Symposium on Security and Privacy (SP) , pages=

Membership inference attacks from first principles , author=. 2022 IEEE Symposium on Security and Privacy (SP) , pages=. 2022 , organization=

work page 2022
[91]

Los Angeles Times , year =

Jonan Valdez , title =. Los Angeles Times , year =

work page
[92]

Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks

Mireshghallah, Fatemehsadat and Goyal, Kartik and Uniyal, Archit and Berg-Kirkpatrick, Taylor and Shokri, Reza. Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.570

work page doi:10.18653/v1/2022.emnlp-main.570 2022
[93]

and Finn, Chelsea , title =

Mitchell, Eric and Lee, Yoonho and Khazatsky, Alexander and Manning, Christopher D. and Finn, Chelsea , title =

work page
[94]

Membership Inference Attacks against Language Models via Neighbourhood Comparison

Mattern, Justus and Mireshghallah, Fatemehsadat and Jin, Zhijing and Schoelkopf, Bernhard and Sachan, Mrinmaya and Berg-Kirkpatrick, Taylor. Membership Inference Attacks against Language Models via Neighbourhood Comparison. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.719

work page doi:10.18653/v1/2023.findings-acl.719 2023
[95]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[96]

30th USENIX Security Symposium (USENIX Security 21) , pages=

Extracting training data from large language models , author=. 30th USENIX Security Symposium (USENIX Security 21) , pages=

work page
[97]

Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harrison Edwards and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and ...

work page 2021
[98]

2017 IEEE symposium on security and privacy (SP) , pages=

Membership inference attacks against machine learning models , author=. 2017 IEEE symposium on security and privacy (SP) , pages=. 2017 , organization=

work page 2017
[99]

arXiv preprint arXiv:2106.11384 , year=

Membership inference on word embedding and beyond , author=. arXiv preprint arXiv:2106.11384 , year=

work page arXiv

Showing first 80 references.