{"paper":{"title":"DataComp-LM: In search of the next generation of training sets for language models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Model-based filtering of web text produces training sets that let 7B language models reach 64% MMLU with 2.6T tokens and 40% less compute than prior open models.","cross_cats":["cs.CL"],"primary_cat":"cs.LG","authors_text":"Aaron Gokaslan, Achal Dave, Alaaeldin El-Nouby, Alexander Toshev, Alexandros G. Dimakis, Alex Fang, Alon Albalak, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Dirk Groeneveld, Etash Guha, Fartash Faghri, Gabriel Ilharco, Georgios Smyrnis, Giannis Daras, Hadi Pouransari, Hanlin Zhang, Hritik Bansal, Igor Vasiljevic, Jean Mercat, Jeffrey Li, Jenia Jitsev, Jieyu Zhang, Josh Gardner, Kalyani Marathe, Khyathi Chandu, Kushal Arora, Kyle Lo, Luca Soldaini, Ludwig Schmidt, Luke Zettlemoyer, Maciej Kilian, Maor Ivgi, Marianna Nezhurina, Matt Jordan, Mayee Chen, Mitchell Wortsman, Niklas Muennighoff, Pang Wei Koh, Reinhard Heckel, Rui Xin, Rulin Shao, Samir Gadre, Sarah Pratt, Saurabh Garg, Sedrick Keh, Sewoong Oh, Sham Kakade, Shuran Song, Stephanie Wang, Suchin Gururangan, Sujay Sanghavi, Sunny Sanyal, Thao Nguyen, Thomas Kollar, Vaishaal Shankar, Yair Carmon, Yonatan Bitton","submitted_at":"2024-06-17T17:42:57Z","abstract_excerpt":"We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Model-based filtering is key to assembling a high-quality training set. The resulting DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens, representing a 6.6 percentage point improvement on MMLU over MAP-Neo while using 40% less compute.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the 53 downstream evaluations and the specific model-based filtering thresholds chosen in the experiments will generalize to other model scales, data sources, and future architectures without significant degradation.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Model-based filtering of web text produces training sets that let 7B language models reach 64% MMLU with 2.6T tokens and 40% less compute than prior open models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"040f3be39f5e9e5e624f6c4730fae5711ffd143b9478423198dac7c2ee8e3cdf"},"source":{"id":"2406.11794","kind":"arxiv","version":4},"verdict":{"id":"6e9f8365-3d5d-463a-b9e1-d0e7f27fa9e9","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T22:53:23.901636Z","strongest_claim":"Model-based filtering is key to assembling a high-quality training set. The resulting DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens, representing a 6.6 percentage point improvement on MMLU over MAP-Neo while using 40% less compute.","one_line_summary":"DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the 53 downstream evaluations and the specific model-based filtering thresholds chosen in the experiments will generalize to other model scales, data sources, and future architectures without significant degradation.","pith_extraction_headline":"Model-based filtering of web text produces training sets that let 7B language models reach 64% MMLU with 2.6T tokens and 40% less compute than prior open models."},"references":{"count":252,"sample":[{"doi":"","year":2023,"title":"Semdedup: Data-efficient learning at web-scale through semantic deduplication, 2023","work_id":"492d4320-a8d4-4094-b226-ea8d784560d9","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone","work_id":"feef9556-a016-493c-abd2-0c97a23a7ebf","ref_index":2,"cited_arxiv_id":"2404.14219","is_internal_anchor":true},{"doi":"10.1145/1645953.1646283","year":2009,"title":"Leela, Krishna Prasad Chitrapura, Sachin Garg, Pavan Kumar GM, Chittaranjan Haty, Anirban Roy, and Amit Sasturkar","work_id":"37f8fecf-1c80-4318-9b35-454ab97c1b66","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Introducing meta llama 3: The most capable openly available llm to date, 2024","work_id":"ec9e7006-eba3-4265-aa1f-4541fa192264","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"FETA: A benchmark for few-sample task transfer in open-domain dialogue","work_id":"68157660-8a73-4920-b691-4de04bb9d143","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":252,"snapshot_sha256":"85e5fdf6b055ff018f6bf2d6e99a7a013ab8f57586dd668f01b421c452aa145e","internal_anchors":40},"formal_canon":{"evidence_count":2,"snapshot_sha256":"ee0ffd7d9263457ca4bc909f12d4210d76f8a0b72b7a203241c1b07c01ce42c7"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}