Pith Number

pith:X37K2SZT

pith:2022:X37K2SZTINTS73PEDCULW2JAAQ

not attested not anchored not stored refs resolved

Quantifying Memorization Across Neural Language Models

Chiyuan Zhang, Daphne Ippolito, Florian Tramer, Katherine Lee, Matthew Jagielski, Nicholas Carlini

Memorization in language models increases log-linearly with model size, data duplication, and prompt length.

arxiv:2202.07646 v3 · 2022-02-15 · cs.LG · cs.CL

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{X37K2SZTINTS73PEDCULW2JAAQ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data. Memorization significantly grows as we increase (1) the capacity of a model, (2) the number of times an example has been duplicated, and (3) the number of tokens of context used to prompt the model.

C2weakest assumption

That verbatim emission under the chosen prompting and matching criteria accurately captures the privacy, utility, and fairness harms, and that the log-linear trends will continue to hold at larger scales without additional confounding factors.

C3one line summary

Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.

References

25 extracted · 25 resolved · 4 Pith anchors

[1] Deep learning with differential privacy 2016

[2] Large-scale differen- tially private BERT

[3] GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , March 2021 · doi:10.5281/zenodo.5297715

[4] Extracting training data from large language models 2012

[5] Evaluating Large Language Models Trained on Code · arXiv:2107.03374

Cited by

39 papers in Pith

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Data-Centric Foundation Models in Computational Healthcare: A Survey

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Towards the Anonymization of the Language Modeling

Receipt and verification

First computed	2026-05-18T04:38:57.963093Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

befead4b3343672fede418a8bb69200411c9639ef98b50f57d7ebcd975c9de30

Aliases

arxiv: 2202.07646 · arxiv_version: 2202.07646v3 · doi: 10.48550/arxiv.2202.07646 · pith_short_12: X37K2SZTINTS · pith_short_16: X37K2SZTINTS73PE · pith_short_8: X37K2SZT

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/X37K2SZTINTS73PEDCULW2JAAQ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: befead4b3343672fede418a8bb69200411c9639ef98b50f57d7ebcd975c9de30

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "b5230ec6f01517894ba4f3fdb0c814e278571fab1daf3260f6f00ccb1142f847",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2022-02-15T18:48:31Z",
    "title_canon_sha256": "b45340bfac6354acfb42d31e8b2975ff2334898c8cb111664e76c5a5e77fe631"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2202.07646",
    "kind": "arxiv",
    "version": 3
  }
}