pith:33NV57HY
Scalable Extraction of Training Data from (Production) Language Models
Adversaries can extract gigabytes of training data from language models including ChatGPT by querying them without prior knowledge of the data.
arxiv:2311.17035 v1 · 2023-11-28 · cs.LG · cs.CL · cs.CR
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{33NV57HYMIFM5GWDBTWSEYIN2F}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.
That the strings emitted by the models are verifiably present in the original training datasets rather than plausible generations, and that the divergence attack requires no prior knowledge of the training data.
Adversaries can scalably extract gigabytes of training data from open, semi-open, and closed language models via querying attacks, including a divergence method that increases extraction rates 150x on aligned models like ChatGPT.
References
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:50.501353Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
dedb5efcf8620ace9ac30ced22610dd1616a0f2592cb05ab0854df3c2d44b3c6
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/33NV57HYMIFM5GWDBTWSEYIN2F \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: dedb5efcf8620ace9ac30ced22610dd1616a0f2592cb05ab0854df3c2d44b3c6
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "78c268ca3f6d957e3e6106181db95edd8ea82003d47c3b317c03666251909969",
"cross_cats_sorted": [
"cs.CL",
"cs.CR"
],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "cs.LG",
"submitted_at": "2023-11-28T18:47:03Z",
"title_canon_sha256": "b92f16cf18c2856205cecdb2cb789e5f9b1896bee9511d819789558b5381838d"
},
"schema_version": "1.0",
"source": {
"id": "2311.17035",
"kind": "arxiv",
"version": 1
}
}