Deduplicating Training Data Mitigates Privacy Risks in Language Models

Colin Raffel; Eric Wallace; Nikhil Kandpal

arxiv: 2202.06539 · v3 · pith:BJ7NM3YRnew · submitted 2022-02-14 · 💻 cs.CR · cs.CL· cs.LG

Deduplicating Training Data Mitigates Privacy Risks in Language Models

Nikhil Kandpal , Eric Wallace , Colin Raffel This is my paper

classification 💻 cs.CR cs.CLcs.LG

keywords trainingsequencesattackslanguagemodelsprivacydatasequence

0 comments

read the original abstract

Past work has shown that large language models are susceptible to privacy attacks, where adversaries generate sequences from a trained model and detect which sequences are memorized from the training set. In this work, we show that the success of these attacks is largely due to duplication in commonly used web-scraped training sets. We first show that the rate at which language models regenerate training sequences is superlinearly related to a sequence's count in the training set. For instance, a sequence that is present 10 times in the training data is on average generated ~1000 times more often than a sequence that is present only once. We next show that existing methods for detecting memorized sequences have near-chance accuracy on non-duplicated training sequences. Finally, we find that after applying methods to deduplicate training data, language models are considerably more secure against these types of privacy attacks. Taken together, our results motivate an increased focus on deduplication in privacy-sensitive applications and a reevaluation of the practicality of existing privacy attacks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
cs.CL 2023-04 accept novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Clinically Grounded Privacy Evaluation of Medical LMs
cs.CL 2026-06 unverdicted novelty 7.0

Presents a clinically grounded privacy evaluation framework for medical LMs that measures verbatim memorization and semantic leakage of diagnoses across tiers of adversarial access, finding high leakage from routine m...
When Tables Leak: Attacking String Memorization in LLM-Based Tabular Data Generation
cs.LG 2025-12 conditional novelty 7.0

LLM tabular generators leak memorized numeric strings, allowing a no-box attack to achieve near-perfect membership inference on some state-of-the-art models.
InCoder: A Generative Model for Code Infilling and Synthesis
cs.SE 2022-04 unverdicted novelty 7.0

InCoder is the first generative model to directly perform zero-shot code infilling via bidirectional context from a masked-then-appended training scheme, matching left-to-right models on synthesis while improving on t...
Quantifying Memorization Across Neural Language Models
cs.LG 2022-02 unverdicted novelty 7.0

Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
cs.LG 2023-03 unverdicted novelty 6.0

SemDeDup removes semantic duplicates from datasets like LAION using pre-trained embeddings, cutting data by 50% with minimal performance loss and efficiency gains on C4.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
cs.CL 2022-04 accept novelty 6.0

GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.