Pith Number

pith:DPM7NWRD

pith:2022:DPM7NWRDQBDKWACAS324KYME6L

not attested not anchored not stored refs resolved

Scaling Laws and Interpretability of Learning from Repeated Data

Ben Mann, Catherine Olsson, Chris Olah, Danny Hernandez, Dario Amodei, Dawn Drain, Jared Kaplan, Nelson Elhage, Nicholas Joseph, Nova DasSarma, Sam McCandlish, Scott Johnston, Sheer El-Showk, Tom Brown, Tom Conerly, Tom Henighan, Tristan Hume, Zac Hatfield-Dodds

Repeating 0.1% of training data 100 times makes an 800M model perform like a 400M model

arxiv:2205.10487 v1 · 2022-05-21 · cs.LG · cs.AI

Open paper page JSON Open Graph Bundle Merged state What is a Pith Number?

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique.

C2weakest assumption

That the performance degradation is primarily caused by memorization consuming model capacity rather than by changes in optimization dynamics or other unmeasured factors.

C3one line summary

Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.

References

71 extracted · 71 resolved · 18 Pith anchors

[1] Learning Transferable Visual Models From Natural Language Supervision 2021 · doi:10.48550/arxiv.2103.00020

[2] Multimodal neurons in artificial neural networks · doi:10.23915/distill.00030

[3] In-context Learning and Induction Heads , year =

[4] Training language models to follow instructions with human feedback 2022 · doi:10.48550/arxiv.2203.02155

[5] A Variational Approach to Learning Curves , url = 2001

Formal links

1 machine-checked theorem link

Cited by

20 papers in Pith

The False Promise of Imitating Proprietary LLMs

Scaling Data-Constrained Language Models

The Falcon Series of Open Language Models

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Receipt and verification

First computed	2026-05-17T23:38:13.661649Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

1bd9f6da238046ab004096f5c56184f2ee4f9d899bfef8747904d11cde8645ea

Aliases

arxiv: 2205.10487 · arxiv_version: 2205.10487v1 · doi: 10.48550/arxiv.2205.10487 · pith_short_12: DPM7NWRDQBDK · pith_short_16: DPM7NWRDQBDKWACA · pith_short_8: DPM7NWRD

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/DPM7NWRDQBDKWACAS324KYME6L \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 1bd9f6da238046ab004096f5c56184f2ee4f9d899bfef8747904d11cde8645ea

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "1f3ba547302854ee4ff49f5540a368b48db97ee6f792bc5d1b6ce32b750eb0bd",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2022-05-21T02:14:27Z",
    "title_canon_sha256": "5a369711a870bc18ae971249f94ed6b0f5346791131e8e2f0ab4be8f4502fb45"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2205.10487",
    "kind": "arxiv",
    "version": 1
  }
}