{"state_type":"pith_open_graph_state","state_version":"1.0","pith_number":"pith:2026:A55W7NL4PBJYG6NCAT2BUUS3DZ","merge_version":"pith-open-graph-merge-v1","event_count":2,"valid_event_count":2,"invalid_event_count":0,"equivocation_count":0,"current":{"canonical_record":{"metadata":{"abstract_canon_sha256":"f906965549c0a762f289deb8b942a9f61d2eee483234690cf181624b5bfbe757","cross_cats_sorted":["cs.CL"],"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2026-05-12T20:22:45Z","title_canon_sha256":"7e7f8912f86f8380410ff61b54d97267a6f1980b1bfcb924e0619016bd4ebff8"},"schema_version":"1.0","source":{"id":"2605.12715","kind":"arxiv","version":1}},"source_aliases":[{"alias_kind":"arxiv","alias_value":"2605.12715","created_at":"2026-05-18T03:09:49Z"},{"alias_kind":"arxiv_version","alias_value":"2605.12715v1","created_at":"2026-05-18T03:09:49Z"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2605.12715","created_at":"2026-05-18T03:09:49Z"},{"alias_kind":"pith_short_12","alias_value":"A55W7NL4PBJY","created_at":"2026-05-18T12:33:37Z"},{"alias_kind":"pith_short_16","alias_value":"A55W7NL4PBJYG6NC","created_at":"2026-05-18T12:33:37Z"},{"alias_kind":"pith_short_8","alias_value":"A55W7NL4","created_at":"2026-05-18T12:33:37Z"}],"graph_snapshots":[{"event_id":"sha256:f80db361f79c1084e6d4fd97f3ff519f1f4fb1c793961b68067d35b961f57066","target":"graph","created_at":"2026-05-18T03:09:49Z","signer":{"key_id":"pith-v1-2026-05","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","signer_id":"pith.science","signer_type":"pith_registry"},"payload":{"graph_snapshot":{"author_claims":{"count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","strong_count":0},"builder_version":"pith-number-builder-2026-05-17-v1","claims":{"count":4,"items":[{"attestation":"unclaimed","claim_id":"C1","kind":"strongest_claim","source":"verdict.strongest_claim","status":"machine_extracted","text":"Across all settings, we find that repetition is a central driver of target-domain performance, and that mixture training tolerates much higher repetition than single-source training: scarce target corpora can be reused 15-20 times, with the optimal number of repetitions depending on the target data size, compute budget, and model scale."},{"attestation":"unclaimed","claim_id":"C2","kind":"weakest_assumption","source":"verdict.weakest_assumption","status":"machine_extracted","text":"The repetition-aware scaling law and optimal repetition counts observed in the tested regimes (model sizes, data types, and compute budgets) will continue to hold at larger scales and for data distributions not included in the 2000 runs."},{"attestation":"unclaimed","claim_id":"C3","kind":"one_line_summary","source":"verdict.one_line_summary","status":"machine_extracted","text":"Repetition-aware scaling laws show scarce target data in pretraining mixtures can be repeated 15-20 times optimally, with the best count depending on data size, compute, and model scale."},{"attestation":"unclaimed","claim_id":"C4","kind":"headline","source":"verdict.pith_extraction.headline","status":"machine_extracted","text":"Mixture pretraining tolerates repeating scarce target data 15-20 times, far more than single-source training."}],"snapshot_sha256":"7732c3a22e0b1c7285865e50126d3e0b61b48fe6c51937c6f0c014264baf441f"},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"paper":{"abstract_excerpt":"As language models scale, the amount of data they require grows -- yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable target data with abundant generic data, which presents a fundamental trade-off: too little target data in the mixture underexposes the model to the target domain, while too much target data repeats the same examples excessively, yielding diminishing returns and eventual overfitting. We study this trade-off across more than 2,000 language-model training runs spa","authors_text":"Anastasiia Sedova, Natalie Schluter, Pierre Ablin, Skyler Seto","cross_cats":["cs.CL"],"headline":"Mixture pretraining tolerates repeating scarce target data 15-20 times, far more than single-source training.","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2026-05-12T20:22:45Z","title":"Scaling Laws for Mixture Pretraining Under Data Constraints"},"references":{"count":58,"internal_anchors":10,"resolved_work":58,"sample":[{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":1,"title":"Scaling Laws for Optimal Data Mixtures , author=. 2025 , eprint=","work_id":"c6cf5e6e-6c65-40db-a6c7-05e25b332329","year":2025},{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":2,"title":"Tensor Programs","work_id":"cd82f433-3741-4bca-8e29-7727676acc09","year":null},{"cited_arxiv_id":"2001.08361","doi":"","is_internal_anchor":true,"ref_index":3,"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","year":2001},{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":4,"title":"arXiv preprint arXiv:2402.07871 , year=","work_id":"e67733fa-7550-4e7a-b1e0-d65341a18264","year":null},{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":5,"title":"Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models , author=. EMNLP , year=","work_id":"61b3db82-81b1-4eef-8e3c-e8a7b63adfc1","year":null}],"snapshot_sha256":"941f284e55429df4cf999defa1c583d929b9436e2a40bf2be64d5b6a3464567e"},"source":{"id":"2605.12715","kind":"arxiv","version":1},"verdict":{"created_at":"2026-05-14T21:41:19.661355Z","id":"f691d398-c9e7-4676-a27d-db1167b3c133","model_set":{"reader":"grok-4.3"},"one_line_summary":"Repetition-aware scaling laws show scarce target data in pretraining mixtures can be repeated 15-20 times optimally, with the best count depending on data size, compute, and model scale.","pipeline_version":"pith-pipeline@v0.9.0","pith_extraction_headline":"Mixture pretraining tolerates repeating scarce target data 15-20 times, far more than single-source training.","strongest_claim":"Across all settings, we find that repetition is a central driver of target-domain performance, and that mixture training tolerates much higher repetition than single-source training: scarce target corpora can be reused 15-20 times, with the optimal number of repetitions depending on the target data size, compute budget, and model scale.","weakest_assumption":"The repetition-aware scaling law and optimal repetition counts observed in the tested regimes (model sizes, data types, and compute budgets) will continue to hold at larger scales and for data distributions not included in the 2000 runs."}},"verdict_id":"f691d398-c9e7-4676-a27d-db1167b3c133"}}],"author_attestations":[],"timestamp_anchors":[],"storage_attestations":[],"citation_signatures":[],"replication_records":[],"corrections":[],"mirror_hints":[],"record_created":{"event_id":"sha256:59a77ea756fa839c9141a73f9938b180fbaaa2fe07c1737fc24f7850ed6462c1","target":"record","created_at":"2026-05-18T03:09:49Z","signer":{"key_id":"pith-v1-2026-05","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","signer_id":"pith.science","signer_type":"pith_registry"},"payload":{"attestation_state":"computed","canonical_record":{"metadata":{"abstract_canon_sha256":"f906965549c0a762f289deb8b942a9f61d2eee483234690cf181624b5bfbe757","cross_cats_sorted":["cs.CL"],"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2026-05-12T20:22:45Z","title_canon_sha256":"7e7f8912f86f8380410ff61b54d97267a6f1980b1bfcb924e0619016bd4ebff8"},"schema_version":"1.0","source":{"id":"2605.12715","kind":"arxiv","version":1}},"canonical_sha256":"077b6fb57c78538379a204f41a525b1e72e460235d2ad33e9841448527863f9c","receipt":{"algorithm":"ed25519","builder_version":"pith-number-builder-2026-05-17-v1","canonical_sha256":"077b6fb57c78538379a204f41a525b1e72e460235d2ad33e9841448527863f9c","first_computed_at":"2026-05-18T03:09:49.483583Z","key_id":"pith-v1-2026-05","kind":"pith_receipt","last_reissued_at":"2026-05-18T03:09:49.483583Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","receipt_version":"0.3","signature_b64":"G8SVQp94MTae7SgEDrhuri6etsDAr8j2rrUSXCQEAaKFhNqibSbh5A0J44YsScP4JeQIs9pfnoKqQrXXybKNBA==","signature_status":"signed_v1","signed_at":"2026-05-18T03:09:49.484467Z","signed_message":"canonical_sha256_bytes"},"source_id":"2605.12715","source_kind":"arxiv","source_version":1}}},"equivocations":[],"invalid_events":[],"applied_event_ids":["sha256:59a77ea756fa839c9141a73f9938b180fbaaa2fe07c1737fc24f7850ed6462c1","sha256:f80db361f79c1084e6d4fd97f3ff519f1f4fb1c793961b68067d35b961f57066"],"state_sha256":"9832f03cf1b99402c3a56834fe12ca20646d27826ebd027b7bf6dc31b55feb68"}