pith. sign in
Pith Number

pith:W52A4EPY

pith:2024:W52A4EPYQQI36QMMPBFIVGNXCH
not attested not anchored not stored refs resolved

DataComp-LM: In search of the next generation of training sets for language models

Aaron Gokaslan, Achal Dave, Alaaeldin El-Nouby, Alexander Toshev, Alexandros G. Dimakis, Alex Fang, Alon Albalak, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Dirk Groeneveld, Etash Guha, Fartash Faghri, Gabriel Ilharco, Georgios Smyrnis, Giannis Daras, Hadi Pouransari, Hanlin Zhang, Hritik Bansal, Igor Vasiljevic, Jean Mercat, Jeffrey Li, Jenia Jitsev, Jieyu Zhang, Josh Gardner, Kalyani Marathe, Khyathi Chandu, Kushal Arora, Kyle Lo, Luca Soldaini, Ludwig Schmidt, Luke Zettlemoyer, Maciej Kilian, Maor Ivgi, Marianna Nezhurina, Matt Jordan, Mayee Chen, Mitchell Wortsman, Niklas Muennighoff, Pang Wei Koh, Reinhard Heckel, Rui Xin, Rulin Shao, Samir Gadre, Sarah Pratt, Saurabh Garg, Sedrick Keh, Sewoong Oh, Sham Kakade, Shuran Song, Stephanie Wang, Suchin Gururangan, Sujay Sanghavi, Sunny Sanyal, Thao Nguyen, Thomas Kollar, Vaishaal Shankar, Yair Carmon, Yonatan Bitton

Model-based filtering of web text produces training sets that let 7B language models reach 64% MMLU with 2.6T tokens and 40% less compute than prior open models.

arxiv:2406.11794 v4 · 2024-06-17 · cs.LG · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{W52A4EPYQQI36QMMPBFIVGNXCH}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Model-based filtering is key to assembling a high-quality training set. The resulting DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens, representing a 6.6 percentage point improvement on MMLU over MAP-Neo while using 40% less compute.

C2weakest assumption

That the 53 downstream evaluations and the specific model-based filtering thresholds chosen in the experiments will generalize to other model scales, data sources, and future architectures without significant degradation.

C3one line summary

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

References

252 extracted · 252 resolved · 40 Pith anchors

[1] Semdedup: Data-efficient learning at web-scale through semantic deduplication, 2023 2023
[2] Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone 2024 · arXiv:2404.14219
[3] Leela, Krishna Prasad Chitrapura, Sachin Garg, Pavan Kumar GM, Chittaranjan Haty, Anirban Roy, and Amit Sasturkar 2009 · doi:10.1145/1645953.1646283
[4] Introducing meta llama 3: The most capable openly available llm to date, 2024 2024
[5] FETA: A benchmark for few-sample task transfer in open-domain dialogue 2022

Formal links

2 machine-checked theorem links

Cited by

20 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:12.734883Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

b7740e11f88411bf418c784a8a99b711d3739f0773d12af22ea4c18a2933cac2

Aliases

arxiv: 2406.11794 · arxiv_version: 2406.11794v4 · doi: 10.48550/arxiv.2406.11794 · pith_short_12: W52A4EPYQQI3 · pith_short_16: W52A4EPYQQI36QMM · pith_short_8: W52A4EPY
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/W52A4EPYQQI36QMMPBFIVGNXCH \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: b7740e11f88411bf418c784a8a99b711d3739f0773d12af22ea4c18a2933cac2
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "ad5121a91dcf1ca13069dbab084eaf0a9312b4f101cd63a2937f895438a109a8",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2024-06-17T17:42:57Z",
    "title_canon_sha256": "ce78271561ec502f377344a9f72c3b6824c230e5956e388314b4d0c37604efa1"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2406.11794",
    "kind": "arxiv",
    "version": 4
  }
}