pith:W52A4EPY
DataComp-LM: In search of the next generation of training sets for language models
Model-based filtering of web text produces training sets that let 7B language models reach 64% MMLU with 2.6T tokens and 40% less compute than prior open models.
arxiv:2406.11794 v4 · 2024-06-17 · cs.LG · cs.CL
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{W52A4EPYQQI36QMMPBFIVGNXCH}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Model-based filtering is key to assembling a high-quality training set. The resulting DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens, representing a 6.6 percentage point improvement on MMLU over MAP-Neo while using 40% less compute.
That the 53 downstream evaluations and the specific model-based filtering thresholds chosen in the experiments will generalize to other model scales, data sources, and future architectures without significant degradation.
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
References
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:12.734883Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
b7740e11f88411bf418c784a8a99b711d3739f0773d12af22ea4c18a2933cac2
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/W52A4EPYQQI36QMMPBFIVGNXCH \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: b7740e11f88411bf418c784a8a99b711d3739f0773d12af22ea4c18a2933cac2
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "ad5121a91dcf1ca13069dbab084eaf0a9312b4f101cd63a2937f895438a109a8",
"cross_cats_sorted": [
"cs.CL"
],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.LG",
"submitted_at": "2024-06-17T17:42:57Z",
"title_canon_sha256": "ce78271561ec502f377344a9f72c3b6824c230e5956e388314b4d0c37604efa1"
},
"schema_version": "1.0",
"source": {
"id": "2406.11794",
"kind": "arxiv",
"version": 4
}
}