pith. sign in
Pith Number

pith:X37K2SZT

pith:2022:X37K2SZTINTS73PEDCULW2JAAQ
not attested not anchored not stored refs resolved

Quantifying Memorization Across Neural Language Models

Chiyuan Zhang, Daphne Ippolito, Florian Tramer, Katherine Lee, Matthew Jagielski, Nicholas Carlini

Memorization in language models increases log-linearly with model size, data duplication, and prompt length.

arxiv:2202.07646 v3 · 2022-02-15 · cs.LG · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{X37K2SZTINTS73PEDCULW2JAAQ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data. Memorization significantly grows as we increase (1) the capacity of a model, (2) the number of times an example has been duplicated, and (3) the number of tokens of context used to prompt the model.

C2weakest assumption

That verbatim emission under the chosen prompting and matching criteria accurately captures the privacy, utility, and fairness harms, and that the log-linear trends will continue to hold at larger scales without additional confounding factors.

C3one line summary

Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.

References

25 extracted · 25 resolved · 4 Pith anchors

[1] Deep learning with differential privacy 2016
[2] Large-scale differen- tially private BERT
[3] GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , March 2021 · doi:10.5281/zenodo.5297715
[4] Extracting training data from large language models 2012
[5] Evaluating Large Language Models Trained on Code · arXiv:2107.03374

Cited by

39 papers in Pith

Receipt and verification
First computed 2026-05-18T04:38:57.963093Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

befead4b3343672fede418a8bb69200411c9639ef98b50f57d7ebcd975c9de30

Aliases

arxiv: 2202.07646 · arxiv_version: 2202.07646v3 · doi: 10.48550/arxiv.2202.07646 · pith_short_12: X37K2SZTINTS · pith_short_16: X37K2SZTINTS73PE · pith_short_8: X37K2SZT
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/X37K2SZTINTS73PEDCULW2JAAQ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: befead4b3343672fede418a8bb69200411c9639ef98b50f57d7ebcd975c9de30
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "b5230ec6f01517894ba4f3fdb0c814e278571fab1daf3260f6f00ccb1142f847",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2022-02-15T18:48:31Z",
    "title_canon_sha256": "b45340bfac6354acfb42d31e8b2975ff2334898c8cb111664e76c5a5e77fe631"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2202.07646",
    "kind": "arxiv",
    "version": 3
  }
}