{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2022:DPM7NWRDQBDKWACAS324KYME6L","short_pith_number":"pith:DPM7NWRD","schema_version":"1.0","canonical_sha256":"1bd9f6da238046ab004096f5c56184f2ee4f9d899bfef8747904d11cde8645ea","source":{"kind":"arxiv","id":"2205.10487","version":1},"attestation_state":"computed","paper":{"title":"Scaling Laws and Interpretability of Learning from Repeated Data","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Repeating 0.1% of training data 100 times makes an 800M model perform like a 400M model","cross_cats":["cs.AI"],"primary_cat":"cs.LG","authors_text":"Ben Mann, Catherine Olsson, Chris Olah, Danny Hernandez, Dario Amodei, Dawn Drain, Jared Kaplan, Nelson Elhage, Nicholas Joseph, Nova DasSarma, Sam McCandlish, Scott Johnston, Sheer El-Showk, Tom Brown, Tom Conerly, Tom Henighan, Tristan Hume, Zac Hatfield-Dodds","submitted_at":"2022-05-21T02:14:27Z","abstract_excerpt":"Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a s"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2205.10487","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.LG","submitted_at":"2022-05-21T02:14:27Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"5a369711a870bc18ae971249f94ed6b0f5346791131e8e2f0ab4be8f4502fb45","abstract_canon_sha256":"1f3ba547302854ee4ff49f5540a368b48db97ee6f792bc5d1b6ce32b750eb0bd"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.662197Z","signature_b64":"cBUSLCLsWy8w4gwDbafz6c0TfM6eRN+pUZwlkULco+txld4Y2eAKud5ydxzFvO+ivZr+eAn0WSEshl4w4ijCAQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"1bd9f6da238046ab004096f5c56184f2ee4f9d899bfef8747904d11cde8645ea","last_reissued_at":"2026-05-17T23:38:13.661649Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.661649Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Scaling Laws and Interpretability of Learning from Repeated Data","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Repeating 0.1% of training data 100 times makes an 800M model perform like a 400M model","cross_cats":["cs.AI"],"primary_cat":"cs.LG","authors_text":"Ben Mann, Catherine Olsson, Chris Olah, Danny Hernandez, Dario Amodei, Dawn Drain, Jared Kaplan, Nelson Elhage, Nicholas Joseph, Nova DasSarma, Sam McCandlish, Scott Johnston, Sheer El-Showk, Tom Brown, Tom Conerly, Tom Henighan, Tristan Hume, Zac Hatfield-Dodds","submitted_at":"2022-05-21T02:14:27Z","abstract_excerpt":"Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a s"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the performance degradation is primarily caused by memorization consuming model capacity rather than by changes in optimization dynamics or other unmeasured factors.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Repeating 0.1% of training data 100 times makes an 800M model perform like a 400M model","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"bbd9633165d07cc554f987a4d1349e7934118a7f79c592aff31d90bf4e0d4fb9"},"source":{"id":"2205.10487","kind":"arxiv","version":1},"verdict":{"id":"08b4d2b7-b812-4d6b-8311-8af17ff860f0","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T15:44:32.218364Z","strongest_claim":"Performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique.","one_line_summary":"Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the performance degradation is primarily caused by memorization consuming model capacity rather than by changes in optimization dynamics or other unmeasured factors.","pith_extraction_headline":"Repeating 0.1% of training data 100 times makes an 800M model perform like a 400M model"},"references":{"count":71,"sample":[{"doi":"10.48550/arxiv.2103.00020","year":2021,"title":"Learning Transferable Visual Models From Natural Language Supervision","work_id":"6de86bb5-27bd-4d5c-8b89-967ebfc52659","ref_index":1,"cited_arxiv_id":"2103.00020","is_internal_anchor":true},{"doi":"10.23915/distill.00030","year":null,"title":"Multimodal neurons in artificial neural networks","work_id":"a5431036-9258-4452-954d-965edf6456ef","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"In-context Learning and Induction Heads , year =","work_id":"e25d4ab0-6097-4d74-841c-db89def7a69b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.48550/arxiv.2203.02155","year":2022,"title":"Training language models to follow instructions with human feedback","work_id":"52aff42f-4fa9-4fcf-bdb3-1459b9bebf65","ref_index":4,"cited_arxiv_id":"2203.02155","is_internal_anchor":true},{"doi":"","year":2001,"title":"A Variational Approach to Learning Curves , url =","work_id":"678d0b26-f77f-4a51-afe7-457123410a55","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":71,"snapshot_sha256":"2b7a18c0f29b5483bd7c0752ff88f6abc042c6ddd43edd3a9b96012bcb387920","internal_anchors":18},"formal_canon":{"evidence_count":1,"snapshot_sha256":"2779946cac165857b6d4bb9b1ed990de343ecaba8be29d223f40f9c0bfc49eb1"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2205.10487","created_at":"2026-05-17T23:38:13.661731+00:00"},{"alias_kind":"arxiv_version","alias_value":"2205.10487v1","created_at":"2026-05-17T23:38:13.661731+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2205.10487","created_at":"2026-05-17T23:38:13.661731+00:00"},{"alias_kind":"pith_short_12","alias_value":"DPM7NWRDQBDK","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"DPM7NWRDQBDKWACA","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"DPM7NWRD","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":20,"internal_anchor_count":20,"sample":[{"citing_arxiv_id":"2509.18218","citing_title":"Similarity Field Theory: A Mathematical Framework for Intelligence","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2305.15717","citing_title":"The False Promise of Imitating Proprietary LLMs","ref_index":57,"is_internal_anchor":true},{"citing_arxiv_id":"2305.16264","citing_title":"Scaling Data-Constrained Language Models","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2311.16867","citing_title":"The Falcon Series of Open Language Models","ref_index":290,"is_internal_anchor":true},{"citing_arxiv_id":"2304.01373","citing_title":"Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling","ref_index":104,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12715","citing_title":"Scaling Laws for Mixture Pretraining Under Data Constraints","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13225","citing_title":"Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2306.01116","citing_title":"The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2505.13211","citing_title":"MAGI-1: Autoregressive Video Generation at Scale","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2211.09085","citing_title":"Galactica: A Large Language Model for Science","ref_index":174,"is_internal_anchor":true},{"citing_arxiv_id":"2211.09085","citing_title":"Galactica: A Large Language Model for Science","ref_index":92,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09189","citing_title":"Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01640","citing_title":"Prescriptive Scaling Laws for Data Constrained Training","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00817","citing_title":"When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2402.06196","citing_title":"Large Language Models: A Survey","ref_index":125,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09389","citing_title":"Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2305.06161","citing_title":"StarCoder: may the source be with you!","ref_index":159,"is_internal_anchor":true},{"citing_arxiv_id":"2303.18223","citing_title":"A Survey of Large Language Models","ref_index":238,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05227","citing_title":"Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2505.10465","citing_title":"Superposition Yields Robust Neural Scaling","ref_index":22,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/DPM7NWRDQBDKWACAS324KYME6L","json":"https://pith.science/pith/DPM7NWRDQBDKWACAS324KYME6L.json","graph_json":"https://pith.science/api/pith-number/DPM7NWRDQBDKWACAS324KYME6L/graph.json","events_json":"https://pith.science/api/pith-number/DPM7NWRDQBDKWACAS324KYME6L/events.json","paper":"https://pith.science/paper/DPM7NWRD"},"agent_actions":{"view_html":"https://pith.science/pith/DPM7NWRDQBDKWACAS324KYME6L","download_json":"https://pith.science/pith/DPM7NWRDQBDKWACAS324KYME6L.json","view_paper":"https://pith.science/paper/DPM7NWRD","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2205.10487&json=true","fetch_graph":"https://pith.science/api/pith-number/DPM7NWRDQBDKWACAS324KYME6L/graph.json","fetch_events":"https://pith.science/api/pith-number/DPM7NWRDQBDKWACAS324KYME6L/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/DPM7NWRDQBDKWACAS324KYME6L/action/timestamp_anchor","attest_storage":"https://pith.science/pith/DPM7NWRDQBDKWACAS324KYME6L/action/storage_attestation","attest_author":"https://pith.science/pith/DPM7NWRDQBDKWACAS324KYME6L/action/author_attestation","sign_citation":"https://pith.science/pith/DPM7NWRDQBDKWACAS324KYME6L/action/citation_signature","submit_replication":"https://pith.science/pith/DPM7NWRDQBDKWACAS324KYME6L/action/replication_record"}},"created_at":"2026-05-17T23:38:13.661731+00:00","updated_at":"2026-05-17T23:38:13.661731+00:00"}