Pith Number
pith:DPM7NWRD
pith:2022:DPM7NWRDQBDKWACAS324KYME6L
not attested
not anchored
not stored
refs resolved
Scaling Laws and Interpretability of Learning from Repeated Data
Repeating 0.1% of training data 100 times makes an 800M model perform like a 400M model
arxiv:2205.10487 v1 · 2022-05-21 · cs.LG · cs.AI
Record completeness
1
Bitcoin timestamp
2
Internet Archive
3
Author claim
· sign in to
claim
4
Citations
5
Replications
✓
Portable graph bundle live · download bundle · merged
state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same
current state with the deterministic merge algorithm.
Claims
C1strongest claim
Performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique.
C2weakest assumption
That the performance degradation is primarily caused by memorization consuming model capacity rather than by changes in optimization dynamics or other unmeasured factors.
C3one line summary
Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.
References
[1] Learning Transferable Visual Models From Natural Language Supervision
[2] Multimodal neurons in artificial neural networks
[3] In-context Learning and Induction Heads , year =
[4] Training language models to follow instructions with human feedback
[5] A Variational Approach to Learning Curves , url =
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:13.661649Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
1bd9f6da238046ab004096f5c56184f2ee4f9d899bfef8747904d11cde8645ea
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/DPM7NWRDQBDKWACAS324KYME6L \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 1bd9f6da238046ab004096f5c56184f2ee4f9d899bfef8747904d11cde8645ea
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "1f3ba547302854ee4ff49f5540a368b48db97ee6f792bc5d1b6ce32b750eb0bd",
"cross_cats_sorted": [
"cs.AI"
],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.LG",
"submitted_at": "2022-05-21T02:14:27Z",
"title_canon_sha256": "5a369711a870bc18ae971249f94ed6b0f5346791131e8e2f0ab4be8f4502fb45"
},
"schema_version": "1.0",
"source": {
"id": "2205.10487",
"kind": "arxiv",
"version": 1
}
}