pith:A55W7NL4
Scaling Laws for Mixture Pretraining Under Data Constraints
Mixture pretraining tolerates repeating scarce target data 15-20 times, far more than single-source training.
arxiv:2605.12715 v1 · 2026-05-12 · cs.LG · cs.CL
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{A55W7NL4PBJYG6NCAT2BUUS3DZ}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Across all settings, we find that repetition is a central driver of target-domain performance, and that mixture training tolerates much higher repetition than single-source training: scarce target corpora can be reused 15-20 times, with the optimal number of repetitions depending on the target data size, compute budget, and model scale.
The repetition-aware scaling law and optimal repetition counts observed in the tested regimes (model sizes, data types, and compute budgets) will continue to hold at larger scales and for data distributions not included in the 2000 runs.
Repetition-aware scaling laws show scarce target data in pretraining mixtures can be repeated 15-20 times optimally, with the best count depending on data size, compute, and model scale.
References
Receipt and verification
| First computed | 2026-05-18T03:09:49.483583Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
077b6fb57c78538379a204f41a525b1e72e460235d2ad33e9841448527863f9c
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/A55W7NL4PBJYG6NCAT2BUUS3DZ \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 077b6fb57c78538379a204f41a525b1e72e460235d2ad33e9841448527863f9c
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "f906965549c0a762f289deb8b942a9f61d2eee483234690cf181624b5bfbe757",
"cross_cats_sorted": [
"cs.CL"
],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "cs.LG",
"submitted_at": "2026-05-12T20:22:45Z",
"title_canon_sha256": "7e7f8912f86f8380410ff61b54d97267a6f1980b1bfcb924e0619016bd4ebff8"
},
"schema_version": "1.0",
"source": {
"id": "2605.12715",
"kind": "arxiv",
"version": 1
}
}