arxiv: 2406.11794 · v4 · pith:W52A4EPYnew · submitted 2024-06-17 · 💻 cs.LG · cs.CL

DataComp-LM: In search of the next generation of training sets for language models

Jeffrey Li , Alex Fang , Georgios Smyrnis , Maor Ivgi , Matt Jordan , Samir Gadre , Hritik Bansal , Etash Guha

show 51 more authors

Sedrick Keh Kushal Arora Saurabh Garg Rui Xin Niklas Muennighoff Reinhard Heckel Jean Mercat Mayee Chen Suchin Gururangan Mitchell Wortsman Alon Albalak Yonatan Bitton Marianna Nezhurina Amro Abbas Cheng-Yu Hsieh Dhruba Ghosh Josh Gardner Maciej Kilian Hanlin Zhang Rulin Shao Sarah Pratt Sunny Sanyal Gabriel Ilharco Giannis Daras Kalyani Marathe Aaron Gokaslan Jieyu Zhang Khyathi Chandu Thao Nguyen Igor Vasiljevic Sham Kakade Shuran Song Sujay Sanghavi Fartash Faghri Sewoong Oh Luke Zettlemoyer Kyle Lo Alaaeldin El-Nouby Hadi Pouransari Alexander Toshev Stephanie Wang Dirk Groeneveld Luca Soldaini Pang Wei Koh Jenia Jitsev Thomas Kollar Alexandros G. Dimakis Yair Carmon Achal Dave Ludwig Schmidt Vaishaal Shankar

This is my paper

Pith reviewed 2026-05-17 22:53 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords language model pretrainingdata curationmodel-based filteringCommon Crawldataset benchmarkopen language modelstraining data quality

0 comments

The pith

Model-based filtering of web text produces training sets that let 7B language models reach 64% MMLU with 2.6T tokens and 40% less compute than prior open models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DataComp for Language Models, a controlled testbed with a 240T-token Common Crawl corpus, standard training recipes, and 53 downstream evaluations. Experiments across model sizes from 412M to 7B parameters show that filtering data with a smaller model to retain only high-quality documents is the decisive curation step. The resulting DCLM-Baseline dataset trains a 7B model from scratch to 64% 5-shot MMLU accuracy. This beats the previous open-data leader by 6.6 points while using 40% less compute and matches several closed 7-8B models on the average of the 53 tasks.

Core claim

Model-based filtering is the key mechanism for assembling high-quality pretraining data. Applied to a large Common Crawl extract, it yields DCLM-Baseline, which supports training a 7B language model to 64% 5-shot accuracy on MMLU using 2.6T tokens. The same model improves 6.6 percentage points over MAP-Neo on MMLU, performs comparably to Mistral-7B-v0.3 and Llama 3 8B on that benchmark, and matches their average score across 53 natural language tasks while requiring 6.6 times less compute than Llama 3 8B.

What carries the argument

Model-based filtering, which trains a smaller classifier on high-quality seed data and then scores and retains only the top documents from the full corpus.

If this is right

Systematic comparison of data strategies becomes possible at multiple scales using the shared corpus and evaluation suite.
High-quality filtered data measurably lowers the compute needed to reach competitive performance on standard benchmarks.
Open training sets can now reach parity with some closed-source 7-8B models on MMLU and the broader 53-task average.
Further gains from deduplication, mixing ratios, or alternative filters can be measured directly against the same baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data quality choices may offer a more immediate performance lever than additional scale in the current 1-7B regime.
A broad multi-task evaluation suite gives a more stable signal for data experiments than reliance on MMLU alone.
The same filtering recipe could be tested on non-English or domain-specific corpora to measure cross-domain transfer.
Re-running the baseline with a different base model family would reveal whether the gains depend on the filter model architecture.

Load-bearing premise

The particular filtering thresholds and the 53-task evaluation suite chosen here will keep producing strong results when the same method is applied at other model scales, to new data sources, or with future architectures.

What would settle it

Train a 7B model on the identical 240T-token corpus but without the model-based filter step and check whether 5-shot MMLU accuracy falls below 58% or requires substantially more than 4T tokens to recover the 64% mark.

read the original abstract

We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper sets up a practical testbed for data curation experiments in LM pretraining and delivers a baseline that beats prior open models on MMLU with less compute, though the gains need tighter isolation from other pipeline changes.

read the letter

The main thing to know is that this work creates a standardized testbed for testing data curation methods on language models and provides a strong baseline dataset that achieves better results than previous open efforts with less compute. They pull 240T tokens from Common Crawl, give pretraining recipes using OpenLM, and include 53 downstream evaluations. Their DCLM-Baseline relies on model-based filtering to train a 7B model to 64% on 5-shot MMLU using 2.6T tokens. This beats MAP-Neo by 6.6 points on MMLU and uses 40% less compute. It also comes close to Mistral-7B-v0.3 and Llama 3 8B on MMLU while using 6.6x less compute than the latter on average tasks. The multi-scale experiments from 412M to 7B add some robustness to the findings. The paper does a good job releasing resources that let others run controlled experiments on filtering, deduplication, and mixing. That's practical for the field. A potential soft spot is whether the gains are cleanly due to the model-based filtering. The stress-test concern is valid if the baseline changes other things like total tokens or mixing ratios at the same time. To securely say filtering is key, the ablations should compare the same setup with only the filter type varied. If those controlled contrasts are in the paper, the claim holds up better; otherwise it's a bit loose. Since this is based on the abstract, the full details would clarify it. This paper is for researchers focused on data quality and efficiency in pretraining language models. People who want to try new curation ideas or compare against a solid open baseline will find it useful. It has enough new infrastructure and reported results to warrant a serious referee. I would recommend sending it for peer review.

Referee Report

1 major / 3 minor

Summary. The paper introduces DataComp-LM (DCLM), a testbed and benchmark for controlled experiments on data curation strategies (deduplication, filtering, mixing) for language model pretraining. It releases a standardized 240T-token corpus from Common Crawl, OpenLM-based training recipes, and a suite of 53 downstream evaluations spanning scales from 412M to 7B parameters. The central empirical result is that model-based filtering is the key ingredient for high-quality training sets; the resulting DCLM-Baseline enables a 7B model trained on 2.6T tokens to reach 64% 5-shot MMLU accuracy, a 6.6-point gain over MAP-Neo with 40% less compute, while remaining competitive with Mistral-7B-v0.3 and Llama 3 8B on MMLU and the average of the 53 tasks.

Significance. If the results hold under controlled conditions, the work is significant for establishing a reproducible platform that shifts emphasis toward data-centric methods in LLM training. The multi-scale experiments, broad evaluation suite, and open release of the corpus and recipes provide concrete value for the community and support further research on dataset design. The reported performance deltas illustrate that careful filtering can deliver gains comparable to those from additional compute or scale.

major comments (1)

[§4] §4 and associated experimental tables: The claim that 'model-based filtering is key' and directly produces the 6.6-point MMLU gain requires isolation of the filtering variable. The DCLM-Baseline differs from the MAP-Neo reference in data volume (2.6T tokens), deduplication, and mixing ratios in addition to the filter itself. A controlled contrast that fixes the underlying corpus, total token count, and all other pipeline steps while varying only the filter (heuristic vs. model-based) is needed to support causal attribution of the gains.

minor comments (3)

The precise model-based filtering thresholds, classifier details, and exclusion rules are referenced but not fully specified in the main text; including pseudocode or a dedicated appendix subsection would improve reproducibility.
[Experimental tables] Performance tables would benefit from reporting variance or results across multiple random seeds to allow assessment of the robustness of the reported improvements.
[Abstract] The abstract states '40% less compute' without defining the metric (e.g., total FLOPs or wall-clock GPU hours); adding this clarification would aid direct comparisons.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting the value of DataComp-LM as a reproducible testbed. We address the major comment below and will revise the manuscript to clarify the scope of our claims and the controlled nature of our ablations.

read point-by-point responses

Referee: [§4] §4 and associated experimental tables: The claim that 'model-based filtering is key' and directly produces the 6.6-point MMLU gain requires isolation of the filtering variable. The DCLM-Baseline differs from the MAP-Neo reference in data volume (2.6T tokens), deduplication, and mixing ratios in addition to the filter itself. A controlled contrast that fixes the underlying corpus, total token count, and all other pipeline steps while varying only the filter (heuristic vs. model-based) is needed to support causal attribution of the gains.

Authors: We agree that a fully isolated comparison strengthens causal attribution and appreciate the referee pointing this out. Our manuscript already reports controlled experiments in §4 that fix the underlying DCLM corpus, total token count, deduplication, and mixing ratios while varying only the filtering strategy (heuristic vs. model-based). These ablations demonstrate that model-based filtering is the key driver of performance gains within our testbed. The DCLM-Baseline vs. MAP-Neo comparison is presented as an end-to-end benchmark of our full pipeline against prior open-data work rather than an isolated ablation of the filter. We will revise §4 to explicitly distinguish the internal controlled contrasts from the external benchmark comparison and to qualify that the 6.6-point MMLU gain reflects the cumulative pipeline (including but not limited to model-based filtering). revision: yes

Circularity Check

0 steps flagged

No circularity in empirical dataset curation results

full rationale

The paper introduces an empirical benchmark (DCLM) and reports performance numbers from training runs on curated data subsets. Central claims rest on direct measurements of downstream accuracy (e.g., MMLU scores) using held-out evaluations rather than any derived quantities, fitted parameters renamed as predictions, or self-citation chains that substitute for independent justification. No equations, uniqueness theorems, or ansatzes are invoked whose validity reduces to the paper's own inputs by construction; results are therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is primarily empirical and relies on standard assumptions in language model pretraining such as the validity of next-token prediction and the representativeness of the chosen downstream tasks. No new theoretical axioms or invented entities are introduced in the abstract.

axioms (1)

domain assumption Next-token prediction on filtered web text produces useful general capabilities
Implicit in the choice to pretrain on the curated Common Crawl corpus and evaluate on MMLU and other NLU tasks.

pith-pipeline@v0.9.0 · 5864 in / 1321 out tokens · 30851 ms · 2026-05-17T22:53:23.901636+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we conduct extensive experiments and find that model-based filtering is key

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
cs.CL 2026-05 conditional novelty 7.0

Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
Projection-Free Transformers via Gaussian Kernel Attention
cs.LG 2026-05 unverdicted novelty 7.0

Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 7.0

Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data
cs.CL 2026-05 unverdicted novelty 6.0

Synthetic pre-pre-training on structured data improves LLM robustness to noisy pre-training, matching baseline loss with up to 49% fewer natural tokens for a 1B model.
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
cs.SE 2026-05 accept novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
NH-CROP: Robust Pricing for Governed Language Data Assets under Cost Uncertainty
cs.AI 2026-05 unverdicted novelty 6.0

NH-CROP introduces a robust online pricing method for governed language data with uncertain costs, using a selective verification gate that improves or matches baselines without relying heavily on paid information acq...
Compute Optimal Tokenization
cs.CL 2026-05 unverdicted novelty 6.0

Compute-optimal language models require parameter count to scale with data bytes rather than tokens, with optimal token compression rate decreasing as compute budget grows.
Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling
cs.CL 2026-04 unverdicted novelty 6.0

X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scale...
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

Expert upcycling expands MoE models by duplicating experts and continuing pre-training, matching baseline performance while saving 32% GPU hours in 7B-13B experiments.
Parcae: Scaling Laws For Stable Looped Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
cs.CL 2026-04 conditional novelty 6.0

Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
CoFrGeNet: Continued Fraction Architectures for Language Generation
cs.CL 2026-01 unverdicted novelty 6.0

CoFrGeNet uses continued-fraction function classes to build transformer replacements that match or beat GPT-2 and Llama performance with half to two-thirds the parameters.
Muon is Scalable for LLM Training
cs.LG 2025-02 unverdicted novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework
cs.CL 2025-11 unverdicted novelty 5.0

Matrix provides a peer-to-peer multi-agent system for synthetic data generation that scales to tens of thousands of workflows and delivers 2-15x higher throughput than centralized designs without quality loss.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
XekRung Technical Report
cs.CR 2026-04 unverdicted novelty 3.0

XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.

Reference graph

Works this paper leans on

252 extracted references · 252 canonical work pages · cited by 16 Pith papers · 42 internal anchors

[1]

SemDeDup: Data-efficient learning at web-scale through semantic deduplication

Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication, 2023. URL https: //arxiv.org/abs/2303.09540

work page internal anchor Pith review arXiv 2023
[2]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. ArXiv preprint, abs/2404.14219, 2024. URL https://arxiv.org/abs/2404.14219

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Leela, Krishna Prasad Chitrapura, Sachin Garg, Pavan Kumar GM, Chittaranjan Haty, Anirban Roy, and Amit Sasturkar

Amit Agarwal, Hema Swetha Koppula, Krishna P. Leela, Krishna Prasad Chitrapura, Sachin Garg, Pavan Kumar GM, Chittaranjan Haty, Anirban Roy, and Amit Sasturkar. Url normalization for de-duplication of web pages. In ACM Conference on Information and Knowledge Management, 2009. https://doi.org/10.1145/1645953.1646283

work page doi:10.1145/1645953.1646283 2009
[4]

Introducing meta llama 3: The most capable openly available llm to date, 2024

Meta AI. Introducing meta llama 3: The most capable openly available llm to date, 2024. https://ai.meta.com/blog/meta-llama-3/

work page 2024
[5]

FETA: A benchmark for few-sample task transfer in open-domain dialogue

Alon Albalak, Yi-Lin Tuan, Pegah Jandaghi, Connor Pryor, Luke Yoffe, Deepak Ramachandran, Lise Getoor, Jay Pujara, and William Yang Wang. FETA: A benchmark for few-sample task transfer in open-domain dialogue. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pp. 10936–10953, Abu Dhabi, United Arab Emirates, 2022....

work page 2022
[6]

Efficient online data mixing for language model pre-training

Alon Albalak, Liangming Pan, Colin Raffel, and William Yang Wang. Efficient online data mixing for language model pre-training. ArXiv preprint, abs/2312.02406, 2023. URL https://arxiv.org/abs/2312.02406

work page arXiv 2023
[7]

Improving few-shot generalization by exploring and exploiting auxiliary data

Alon Albalak, Colin Raffel, and William Yang Wang. Improving few-shot generalization by exploring and exploiting auxiliary data. In Advances in Neural Information Processing Systems (NeurIPS), 2023. https://openreview.net/forum?id=JDnLXc4NOn

work page 2023
[8]

A survey on 12 data selection for language models

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, et al. A survey on 12 data selection for language models. ArXiv preprint, abs/2402.16827, 2024. URL https: //arxiv.org/abs/2402.16827

work page arXiv 2024
[9]

Santacoder: Don’t reach for the stars!

Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. Santacoder: don’t reach for the stars! ArXiv preprint, abs/2301.03988, 2023. URL https://arxiv.org/abs/2301.03988

work page arXiv 2023
[10]

The Falcon Series of Open Language Models

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra- Aimée Cojocaru, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

M ath QA : Towards interpretable math word problem solving with operation-based formalisms

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and...

work page doi:10.18653/v1/n19-1245 2019
[12]

Leavitt, and Mansheej Paul

Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, and Mansheej Paul. Perplexed by perplexity: Perplexity-based data pruning with small reference models. ArXiv preprint, abs/2405.20541, 2024. URL https://arxiv.org/abs/2405. 20541

work page arXiv 2024
[13]

Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, David Berard, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Laurent Kirsch, Michael Lazos, Yanbo Liang, Jason Liang, Yinghai Lu, CK Luk...

work page 2024
[14]

Llemma: An open language model for mathematics

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. ArXiv preprint, abs/2310.10631, 2023. URL https://arxiv.org/ abs/2310.10631

work page arXiv 2023
[15]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.ArXiv preprint, abs/1607.06450, 2016. URL https://arxiv.org/abs/1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

Comparing bad apples to good oranges: Aligning large language models via joint preference optimization

Hritik Bansal, Ashima Suvarna, Gantavya Bhatt, Nanyun Peng, Kai-Wei Chang, and Aditya Grover. Comparing bad apples to good oranges: Aligning large language models via joint preference optimization. arXiv preprint arXiv:2404.00530, 2024

work page arXiv 2024
[17]

Trafilatura: A web scraping library and command-line tool for text discovery and extraction

Adrien Barbaresi. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 122–131, Online, 2021. Association for Computational ...

work page doi:10.18653/v1/2021.acl-demo.15 2021
[18]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. In Transactions on Machine Learning Research (TMLR), 2023. https: //openreview.net/forum?id=uyTL5Bvosj

work page 2023
[19]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021

work page 2021
[20]

Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl

Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl. In European Conference on Information Retrieval Research (ECIR) , 2018. https://github.com/chatnoir-eu/ chatnoir-resiliparse

work page 2018
[21]

FastWARC: Optimizing Large-Scale Web Archive Analytics

Janek Bevendorff, Martin Potthast, and Benno Stein. FastWARC: Optimizing Large-Scale Web Archive Analytics. In International Symposium on Open Search Technology (OSSYM),

work page
[22]

https://github.com/chatnoir-eu/chatnoir-resiliparse

work page
[23]

DeepSeek-AI Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wen-Hui Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y . K. Li, Wenfeng Liang, Fangyun Lin, A. X....

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

PIQA: reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan LeBras, Jianfeng Gao, and Yejin Choi. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in ...

work page 2020
[25]

URL https://aaai.org/ojs/index.php/AAAI/article/ view/6239

AAAI Press, 2020. URL https://aaai.org/ojs/index.php/AAAI/article/ view/6239

work page 2020
[26]

GPT- NeoX-20B: An open-source autoregressive language model

Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, Usvsn Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT- NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode...

work page 2022
[27]

Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors.Communications of the ACM, 1970. https://doi.org/10.1145/362686.362692

work page doi:10.1145/362686.362692 1970
[28]

Space/time trade-offs in hash coding with allowable errors

Burton H Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422–426, 1970

work page 1970
[29]

Nuanced metrics for measuring unintended bias with real data for text classification

Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion proceedings of the 2019 world wide web conference, pp. 491–500, 2019. 14

work page 2019
[30]

Color-filter: Conditional loss reduction filtering for targeted language model pre-training

David Brandfonbrener, Hanlin Zhang, Andreas Kirsch, Jonathan Richard Schwarz, and Sham M Kakade. Color-filter: Conditional loss reduction filtering for targeted language model pre-training. arXiv preprint, 2024

work page 2024
[31]

Andrei Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences, 1997

work page 1997
[32]

A.Z. Broder. On the resemblance and containment of documents. InProceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), pp. 21–29, 1997. doi: 10.1109/ SEQUEN.1997.666900

work page arXiv 1997
[33]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page 2020
[34]

IGLUE: A benchmark for transfer learning across modalities, tasks, and languages

Emanuele Bugliarello, Fangyu Liu, Jonas Pfeiffer, Siva Reddy, Desmond Elliott, Edoardo Maria Ponti, and Ivan Vulic. IGLUE: A benchmark for transfer learning across modalities, tasks, and languages. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.),International Conference on Machine Learning, ICML 2022, ...

work page 2022
[35]

Human alignment of large language models through online preference optimisation

Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, et al. Human alignment of large language models through online preference optimisation. arXiv preprint arXiv:2403.08635, 2024

work page arXiv 2024
[36]

Extracting training data from large language models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021

work page 2021
[37]

Quantifying memorization across neural language models, 2023

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models, 2023

work page 2023
[38]

Data- juicer: A one-stop data processing system for large language models

Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, and Jingren Zhou. Data- juicer: A one-stop data processing system for large language models. In Companion of the 2024 International Conference on Management of Data, SIGMOD/PODS ’24, pp. 120–134, New York, NY , ...

work page doi:10.1145/3626246.3653385 2024
[39]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winte...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[40]

Skill-it! a data-driven skills framework for understanding and training language models

Mayee Chen, Nicholas Roberts, Kush Bhatia, Jue WANG, Ce Zhang, Frederic Sala, and Christopher Ré. Skill-it! a data-driven skills framework for understanding and training language models. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 36000–36040. Curran Assoc...

work page 2023
[41]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Benton C. Hutchinson, Reiner Pope, Ja...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. ArXiv preprint, abs/2210.11416, 2022. URL https://arxiv.org/abs/ 2210.11416

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

BoolQ: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...

work page doi:10.18653/v1/n19-1300 2019
[44]

Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A. Smith. All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Lon...

work page 2021
[45]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv preprint, abs/1803.05457, 2018. URL https://arxiv.org/abs/1803. 05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[47]

URL https://arxiv.org/abs/2110.14168. 16

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Common Crawl, 2007

Common Crawl. Common Crawl, 2007. https://commoncrawl.org

work page 2007
[49]

Redpajama: an open dataset for training large language models, 2023

Together Computer. Redpajama: an open dataset for training large language models, 2023. URLhttps://github.com/togethercomputer/RedPajama-Data

work page 2023
[50]

Cross-lingual language model pretraining

Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché- Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vanc...

work page 2019
[51]

Unicode Standard Annex #29: Unicode Text Segmentation, 2023

The Unicode Consortium. Unicode Standard Annex #29: Unicode Text Segmentation, 2023. URLhttps://www.unicode.org/reports/tr29/

work page 2023
[52]

Ultrafeedback: Boosting language models with high-quality feedback, 2023

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2023

work page 2023
[53]

DC-BENCH: Dataset condensation benchmark

Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. DC-BENCH: Dataset condensation benchmark. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=Bs8iFQ7AM6

work page 2022
[54]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning (ICML), 2023. https://proceedings.mlr.press/v202/dehghani23a.html

work page 2023
[55]

Documenting large webtext corpora: A case study on the colossal clean crawled corpus

Jesse Dodge, Maarten Sap, Ana Marasovi´c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1286–1305, Online and Punta Cana, Dominican Republic,

work page 2021
[56]

doi: 10.18653/v1/2021.emnlp-main.98

Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.98. URLhttps://aclanthology.org/2021.emnlp-main.98

work page doi:10.18653/v1/2021.emnlp-main.98 2021
[57]

Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P

Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P. Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen S. Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V . Le, Yonghui Wu, Zhifeng...

work page 2022
[58]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

URL https://arxiv.org/abs/2310.20707

work page arXiv
[61]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. ArXiv preprint, abs/2402.01306, 2024. URLhttps://arxiv.org/abs/2402.01306. 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

What’s going on with the open llm leaderboard? https://huggingface

Hugging Face. What’s going on with the open llm leaderboard? https://huggingface. co/blog/open-llm-leaderboard-mmlu , 2023

work page 2023
[63]

Doge: Domain reweighting with generalization estimation

Simin Fan, Matteo Pagliardini, and Martin Jaggi. Doge: Domain reweighting with generalization estimation. ArXiv preprint, abs/2310.15393, 2023. URL https://arxiv. org/abs/2310.15393

work page arXiv 2023
[64]

Data filtering networks

Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks. ArXiv preprint, abs/2309.17425, 2023. URL https://arxiv.org/abs/2309.17425

work page arXiv 2023
[65]

Lighteval: A lightweight framework for llm evaluation, 2023

Clémentine Fourrier, Nathan Habib, Thomas Wolf, and Lewis Tunstall. Lighteval: A lightweight framework for llm evaluation, 2023. URL https://github.com/ huggingface/lighteval

work page 2023
[66]

Dat- acomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, 2024. https://arxiv.org/abs/2304.14108

work page arXiv 2024
[67]

Language models scale reliably with over-training and on downstream tasks.arXiv preprint arXiv:2403.08540,

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Alexandros G. Dimakis, Gabriel Ilharco, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, and Ludwig ...

work page arXiv 2024
[69]

URL https://arxiv.org/abs/2101.00027

work page internal anchor Pith review Pith/arXiv arXiv
[70]

Data mixing made efficient: A bivariate scaling law for language model pretraining

Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, and Bolin Ding. Data mixing made efficient: A bivariate scaling law for language model pretraining. ArXiv preprint, abs/2405.14908, 2024. URLhttps://arxiv.org/abs/2405.14908

work page arXiv 2024
[71]

Realtoxicityprompts: Evaluating neural toxic degeneration in language models

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. Findings of the Association for Computational Linguistics: EMNLP 2020, 2020

work page 2020
[72]

Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets

Mor Geva, Yoav Goldberg, and Jonathan Berant. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pp. 1161...

work page doi:10.18653/v1/d19-1107 2019
[73]

Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics , 9:346–361, 2021. doi: 10. 1162/tacl_a_00370. URL https://aclanthology.org/2021.tacl-1.21

work page 2021
[74]

Non-expert evaluation of summarization systems is risky

Dan Gillick and Yang Liu. Non-expert evaluation of summarization systems is risky. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 148–151, Los Angeles, 2010. Association for Computational Linguistics. URL https://aclanthology.org/W10-0722. 18

work page 2010
[75]

Zamba: A compact 7b ssm hybrid model

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybrid model. ArXiv preprint, abs/2405.16712, 2024. URL https://arxiv.org/abs/2405.16712

work page arXiv 2024
[76]

Openwebtext corpus, 2019

Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus, 2019. http://Skylion007.github.io/OpenWebTextCorpus

work page 2019
[77]

Learning word vectors for 157 languages

Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. Learning word vectors for 157 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , Miyazaki, Japan, 2018. European Language Resources Association (ELRA). URL https://aclanthology.org/ L18-1550

work page 2018
[78]

The big friendly filter

Dirk Groeneveld. The big friendly filter. https://github.com/allenai/bff, 2023

work page 2023
[79]

OLMo: Accelerating the Science of Language Models

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Accelerating the science of language models. ArXiv preprint, abs/2402.00838, 2024. URL https:// arxiv.org/abs/2402.00838

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. ArXiv preprint, abs/2312.00752, 2023. URL https://arxiv.org/abs/2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2023
[81]

Textbooks are all you need

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Cesar, Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need. Preprint, 2023. https:/...

work page 2023
[82]

OpenLM: a minimal but performative language modeling (lm) repository, 2023

Suchin Gururangan, Mitchell Wortsman, Samir Yitzhak Gadre, Achal Dave, Maciej Kilian, Weijia Shi, Jean Mercat, Georgios Smyrnis, Gabriel Ilharco, Matt Jordan, Reinhard Heckel, Alex Dimakis, Ali Farhadi, Vaishaal Shankar, and Ludwig Schmidt. OpenLM: a minimal but performative language modeling (lm) repository, 2023. https://github. com/mlfoundations/open_lm

work page 2023
[84]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021

Showing first 80 references.