Pith Number

pith:A55W7NL4

pith:2026:A55W7NL4PBJYG6NCAT2BUUS3DZ

not attested not anchored not stored refs resolved

Scaling Laws for Mixture Pretraining Under Data Constraints

Anastasiia Sedova, Natalie Schluter, Pierre Ablin, Skyler Seto

Mixture pretraining tolerates repeating scarce target data 15-20 times, far more than single-source training.

arxiv:2605.12715 v1 · 2026-05-12 · cs.LG · cs.CL

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{A55W7NL4PBJYG6NCAT2BUUS3DZ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Across all settings, we find that repetition is a central driver of target-domain performance, and that mixture training tolerates much higher repetition than single-source training: scarce target corpora can be reused 15-20 times, with the optimal number of repetitions depending on the target data size, compute budget, and model scale.

C2weakest assumption

The repetition-aware scaling law and optimal repetition counts observed in the tested regimes (model sizes, data types, and compute budgets) will continue to hold at larger scales and for data distributions not included in the 2000 runs.

C3one line summary

Repetition-aware scaling laws show scarce target data in pretraining mixtures can be repeated 15-20 times optimally, with the best count depending on data size, compute, and model scale.

References

58 extracted · 58 resolved · 10 Pith anchors

[1] Scaling Laws for Optimal Data Mixtures , author=. 2025 , eprint= 2025

[2] Tensor Programs

[3] Scaling Laws for Neural Language Models 2001 · arXiv:2001.08361

[4] arXiv preprint arXiv:2402.07871 , year=

[5] Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models , author=. EMNLP , year=

Receipt and verification

First computed	2026-05-18T03:09:49.483583Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

077b6fb57c78538379a204f41a525b1e72e460235d2ad33e9841448527863f9c

Aliases

arxiv: 2605.12715 · arxiv_version: 2605.12715v1 · doi: 10.48550/arxiv.2605.12715 · pith_short_12: A55W7NL4PBJY · pith_short_16: A55W7NL4PBJYG6NC · pith_short_8: A55W7NL4

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/A55W7NL4PBJYG6NCAT2BUUS3DZ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 077b6fb57c78538379a204f41a525b1e72e460235d2ad33e9841448527863f9c

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "f906965549c0a762f289deb8b942a9f61d2eee483234690cf181624b5bfbe757",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-12T20:22:45Z",
    "title_canon_sha256": "7e7f8912f86f8380410ff61b54d97267a6f1980b1bfcb924e0619016bd4ebff8"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.12715",
    "kind": "arxiv",
    "version": 1
  }
}