pith. machine review for the scientific record. sign in

arxiv: 2406.11794 · v4 · pith:W52A4EPYnew · submitted 2024-06-17 · 💻 cs.LG · cs.CL

DataComp-LM: In search of the next generation of training sets for language models

Pith reviewed 2026-05-17 22:53 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords language model pretrainingdata curationmodel-based filteringCommon Crawldataset benchmarkopen language modelstraining data quality
0
0 comments X

The pith

Model-based filtering of web text produces training sets that let 7B language models reach 64% MMLU with 2.6T tokens and 40% less compute than prior open models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DataComp for Language Models, a controlled testbed with a 240T-token Common Crawl corpus, standard training recipes, and 53 downstream evaluations. Experiments across model sizes from 412M to 7B parameters show that filtering data with a smaller model to retain only high-quality documents is the decisive curation step. The resulting DCLM-Baseline dataset trains a 7B model from scratch to 64% 5-shot MMLU accuracy. This beats the previous open-data leader by 6.6 points while using 40% less compute and matches several closed 7-8B models on the average of the 53 tasks.

Core claim

Model-based filtering is the key mechanism for assembling high-quality pretraining data. Applied to a large Common Crawl extract, it yields DCLM-Baseline, which supports training a 7B language model to 64% 5-shot accuracy on MMLU using 2.6T tokens. The same model improves 6.6 percentage points over MAP-Neo on MMLU, performs comparably to Mistral-7B-v0.3 and Llama 3 8B on that benchmark, and matches their average score across 53 natural language tasks while requiring 6.6 times less compute than Llama 3 8B.

What carries the argument

Model-based filtering, which trains a smaller classifier on high-quality seed data and then scores and retains only the top documents from the full corpus.

If this is right

  • Systematic comparison of data strategies becomes possible at multiple scales using the shared corpus and evaluation suite.
  • High-quality filtered data measurably lowers the compute needed to reach competitive performance on standard benchmarks.
  • Open training sets can now reach parity with some closed-source 7-8B models on MMLU and the broader 53-task average.
  • Further gains from deduplication, mixing ratios, or alternative filters can be measured directly against the same baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Data quality choices may offer a more immediate performance lever than additional scale in the current 1-7B regime.
  • A broad multi-task evaluation suite gives a more stable signal for data experiments than reliance on MMLU alone.
  • The same filtering recipe could be tested on non-English or domain-specific corpora to measure cross-domain transfer.
  • Re-running the baseline with a different base model family would reveal whether the gains depend on the filter model architecture.

Load-bearing premise

The particular filtering thresholds and the 53-task evaluation suite chosen here will keep producing strong results when the same method is applied at other model scales, to new data sources, or with future architectures.

What would settle it

Train a 7B model on the identical 240T-token corpus but without the model-based filter step and check whether 5-shot MMLU accuracy falls below 58% or requires substantially more than 4T tokens to recover the 64% mark.

read the original abstract

We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces DataComp-LM (DCLM), a testbed and benchmark for controlled experiments on data curation strategies (deduplication, filtering, mixing) for language model pretraining. It releases a standardized 240T-token corpus from Common Crawl, OpenLM-based training recipes, and a suite of 53 downstream evaluations spanning scales from 412M to 7B parameters. The central empirical result is that model-based filtering is the key ingredient for high-quality training sets; the resulting DCLM-Baseline enables a 7B model trained on 2.6T tokens to reach 64% 5-shot MMLU accuracy, a 6.6-point gain over MAP-Neo with 40% less compute, while remaining competitive with Mistral-7B-v0.3 and Llama 3 8B on MMLU and the average of the 53 tasks.

Significance. If the results hold under controlled conditions, the work is significant for establishing a reproducible platform that shifts emphasis toward data-centric methods in LLM training. The multi-scale experiments, broad evaluation suite, and open release of the corpus and recipes provide concrete value for the community and support further research on dataset design. The reported performance deltas illustrate that careful filtering can deliver gains comparable to those from additional compute or scale.

major comments (1)
  1. [§4] §4 and associated experimental tables: The claim that 'model-based filtering is key' and directly produces the 6.6-point MMLU gain requires isolation of the filtering variable. The DCLM-Baseline differs from the MAP-Neo reference in data volume (2.6T tokens), deduplication, and mixing ratios in addition to the filter itself. A controlled contrast that fixes the underlying corpus, total token count, and all other pipeline steps while varying only the filter (heuristic vs. model-based) is needed to support causal attribution of the gains.
minor comments (3)
  1. The precise model-based filtering thresholds, classifier details, and exclusion rules are referenced but not fully specified in the main text; including pseudocode or a dedicated appendix subsection would improve reproducibility.
  2. [Experimental tables] Performance tables would benefit from reporting variance or results across multiple random seeds to allow assessment of the robustness of the reported improvements.
  3. [Abstract] The abstract states '40% less compute' without defining the metric (e.g., total FLOPs or wall-clock GPU hours); adding this clarification would aid direct comparisons.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting the value of DataComp-LM as a reproducible testbed. We address the major comment below and will revise the manuscript to clarify the scope of our claims and the controlled nature of our ablations.

read point-by-point responses
  1. Referee: [§4] §4 and associated experimental tables: The claim that 'model-based filtering is key' and directly produces the 6.6-point MMLU gain requires isolation of the filtering variable. The DCLM-Baseline differs from the MAP-Neo reference in data volume (2.6T tokens), deduplication, and mixing ratios in addition to the filter itself. A controlled contrast that fixes the underlying corpus, total token count, and all other pipeline steps while varying only the filter (heuristic vs. model-based) is needed to support causal attribution of the gains.

    Authors: We agree that a fully isolated comparison strengthens causal attribution and appreciate the referee pointing this out. Our manuscript already reports controlled experiments in §4 that fix the underlying DCLM corpus, total token count, deduplication, and mixing ratios while varying only the filtering strategy (heuristic vs. model-based). These ablations demonstrate that model-based filtering is the key driver of performance gains within our testbed. The DCLM-Baseline vs. MAP-Neo comparison is presented as an end-to-end benchmark of our full pipeline against prior open-data work rather than an isolated ablation of the filter. We will revise §4 to explicitly distinguish the internal controlled contrasts from the external benchmark comparison and to qualify that the 6.6-point MMLU gain reflects the cumulative pipeline (including but not limited to model-based filtering). revision: yes

Circularity Check

0 steps flagged

No circularity in empirical dataset curation results

full rationale

The paper introduces an empirical benchmark (DCLM) and reports performance numbers from training runs on curated data subsets. Central claims rest on direct measurements of downstream accuracy (e.g., MMLU scores) using held-out evaluations rather than any derived quantities, fitted parameters renamed as predictions, or self-citation chains that substitute for independent justification. No equations, uniqueness theorems, or ansatzes are invoked whose validity reduces to the paper's own inputs by construction; results are therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is primarily empirical and relies on standard assumptions in language model pretraining such as the validity of next-token prediction and the representativeness of the chosen downstream tasks. No new theoretical axioms or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Next-token prediction on filtered web text produces useful general capabilities
    Implicit in the choice to pretrain on the curated Common Crawl corpus and evaluate on MMLU and other NLU tasks.

pith-pipeline@v0.9.0 · 5864 in / 1321 out tokens · 30851 ms · 2026-05-17T22:53:23.901636+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

    cs.CL 2026-05 conditional novelty 7.0

    Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.

  2. Projection-Free Transformers via Gaussian Kernel Attention

    cs.LG 2026-05 unverdicted novelty 7.0

    Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.

  3. Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 7.0

    Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.

  4. Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data

    cs.CL 2026-05 unverdicted novelty 6.0

    Synthetic pre-pre-training on structured data improves LLM robustness to noisy pre-training, matching baseline loss with up to 49% fewer natural tokens for a 1B model.

  5. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  6. Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

    cs.SE 2026-05 accept novelty 6.0

    A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.

  7. NH-CROP: Robust Pricing for Governed Language Data Assets under Cost Uncertainty

    cs.AI 2026-05 unverdicted novelty 6.0

    NH-CROP introduces a robust online pricing method for governed language data with uncertain costs, using a selective verification gate that improves or matches baselines without relying heavily on paid information acq...

  8. Compute Optimal Tokenization

    cs.CL 2026-05 unverdicted novelty 6.0

    Compute-optimal language models require parameter count to scale with data bytes rather than tokens, with optimal token compression rate decreasing as compute budget grows.

  9. Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling

    cs.CL 2026-04 unverdicted novelty 6.0

    X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scale...

  10. Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    Expert upcycling expands MoE models by duplicating experts and continuing pre-training, matching baseline performance while saving 32% GPU hours in 7B-13B experiments.

  11. Parcae: Scaling Laws For Stable Looped Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...

  12. Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

    cs.CL 2026-04 conditional novelty 6.0

    Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.

  13. CoFrGeNet: Continued Fraction Architectures for Language Generation

    cs.CL 2026-01 unverdicted novelty 6.0

    CoFrGeNet uses continued-fraction function classes to build transformer replacements that match or beat GPT-2 and Llama performance with half to two-thirds the parameters.

  14. Muon is Scalable for LLM Training

    cs.LG 2025-02 unverdicted novelty 6.0

    Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

  15. Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

    cs.CL 2025-11 unverdicted novelty 5.0

    Matrix provides a peer-to-peer multi-agent system for synthetic data generation that scales to tens of thousands of workflows and delivers 2-15x higher throughput than centralized designs without quality loss.

  16. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    cs.CL 2025-02 unverdicted novelty 5.0

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  17. XekRung Technical Report

    cs.CR 2026-04 unverdicted novelty 3.0

    XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.

Reference graph

Works this paper leans on

252 extracted references · 252 canonical work pages · cited by 16 Pith papers · 42 internal anchors

  1. [1]

    SemDeDup: Data-efficient learning at web-scale through semantic deduplication

    Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication, 2023. URL https: //arxiv.org/abs/2303.09540

  2. [2]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. ArXiv preprint, abs/2404.14219, 2024. URL https://arxiv.org/abs/2404.14219

  3. [3]

    Leela, Krishna Prasad Chitrapura, Sachin Garg, Pavan Kumar GM, Chittaranjan Haty, Anirban Roy, and Amit Sasturkar

    Amit Agarwal, Hema Swetha Koppula, Krishna P. Leela, Krishna Prasad Chitrapura, Sachin Garg, Pavan Kumar GM, Chittaranjan Haty, Anirban Roy, and Amit Sasturkar. Url normalization for de-duplication of web pages. In ACM Conference on Information and Knowledge Management, 2009. https://doi.org/10.1145/1645953.1646283

  4. [4]

    Introducing meta llama 3: The most capable openly available llm to date, 2024

    Meta AI. Introducing meta llama 3: The most capable openly available llm to date, 2024. https://ai.meta.com/blog/meta-llama-3/

  5. [5]

    FETA: A benchmark for few-sample task transfer in open-domain dialogue

    Alon Albalak, Yi-Lin Tuan, Pegah Jandaghi, Connor Pryor, Luke Yoffe, Deepak Ramachandran, Lise Getoor, Jay Pujara, and William Yang Wang. FETA: A benchmark for few-sample task transfer in open-domain dialogue. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pp. 10936–10953, Abu Dhabi, United Arab Emirates, 2022....

  6. [6]

    Efficient online data mixing for language model pre-training

    Alon Albalak, Liangming Pan, Colin Raffel, and William Yang Wang. Efficient online data mixing for language model pre-training. ArXiv preprint, abs/2312.02406, 2023. URL https://arxiv.org/abs/2312.02406

  7. [7]

    Improving few-shot generalization by exploring and exploiting auxiliary data

    Alon Albalak, Colin Raffel, and William Yang Wang. Improving few-shot generalization by exploring and exploiting auxiliary data. In Advances in Neural Information Processing Systems (NeurIPS), 2023. https://openreview.net/forum?id=JDnLXc4NOn

  8. [8]

    A survey on 12 data selection for language models

    Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, et al. A survey on 12 data selection for language models. ArXiv preprint, abs/2402.16827, 2024. URL https: //arxiv.org/abs/2402.16827

  9. [9]

    Santacoder: Don’t reach for the stars!

    Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. Santacoder: don’t reach for the stars! ArXiv preprint, abs/2301.03988, 2023. URL https://arxiv.org/abs/2301.03988

  10. [10]

    The Falcon Series of Open Language Models

    Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra- Aimée Cojocaru, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023

  11. [11]

    M ath QA : Towards interpretable math word problem solving with operation-based formalisms

    Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and...

  12. [12]

    Leavitt, and Mansheej Paul

    Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, and Mansheej Paul. Perplexed by perplexity: Perplexity-based data pruning with small reference models. ArXiv preprint, abs/2405.20541, 2024. URL https://arxiv.org/abs/2405. 20541

  13. [13]

    Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation

    Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, David Berard, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Laurent Kirsch, Michael Lazos, Yanbo Liang, Jason Liang, Yinghai Lu, CK Luk...

  14. [14]

    Llemma: An open language model for mathematics

    Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. ArXiv preprint, abs/2310.10631, 2023. URL https://arxiv.org/ abs/2310.10631

  15. [15]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.ArXiv preprint, abs/1607.06450, 2016. URL https://arxiv.org/abs/1607.06450

  16. [16]

    Comparing bad apples to good oranges: Aligning large language models via joint preference optimization

    Hritik Bansal, Ashima Suvarna, Gantavya Bhatt, Nanyun Peng, Kai-Wei Chang, and Aditya Grover. Comparing bad apples to good oranges: Aligning large language models via joint preference optimization. arXiv preprint arXiv:2404.00530, 2024

  17. [17]

    Trafilatura: A web scraping library and command-line tool for text discovery and extraction

    Adrien Barbaresi. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 122–131, Online, 2021. Association for Computational ...

  18. [18]

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

    BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. In Transactions on Machine Learning Research (TMLR), 2023. https: //openreview.net/forum?id=uyTL5Bvosj

  19. [19]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

    Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021

  20. [20]

    Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl

    Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl. In European Conference on Information Retrieval Research (ECIR) , 2018. https://github.com/chatnoir-eu/ chatnoir-resiliparse

  21. [21]

    FastWARC: Optimizing Large-Scale Web Archive Analytics

    Janek Bevendorff, Martin Potthast, and Benno Stein. FastWARC: Optimizing Large-Scale Web Archive Analytics. In International Symposium on Open Search Technology (OSSYM),

  22. [22]

    https://github.com/chatnoir-eu/chatnoir-resiliparse

  23. [23]

    DeepSeek-AI Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wen-Hui Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y . K. Li, Wenfeng Liang, Fangyun Lin, A. X....

  24. [24]

    PIQA: reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan LeBras, Jianfeng Gao, and Yejin Choi. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in ...

  25. [25]

    URL https://aaai.org/ojs/index.php/AAAI/article/ view/6239

    AAAI Press, 2020. URL https://aaai.org/ojs/index.php/AAAI/article/ view/6239

  26. [26]

    GPT- NeoX-20B: An open-source autoregressive language model

    Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, Usvsn Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT- NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode...

  27. [27]

    Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors.Communications of the ACM, 1970. https://doi.org/10.1145/362686.362692

  28. [28]

    Space/time trade-offs in hash coding with allowable errors

    Burton H Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422–426, 1970

  29. [29]

    Nuanced metrics for measuring unintended bias with real data for text classification

    Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion proceedings of the 2019 world wide web conference, pp. 491–500, 2019. 14

  30. [30]

    Color-filter: Conditional loss reduction filtering for targeted language model pre-training

    David Brandfonbrener, Hanlin Zhang, Andreas Kirsch, Jonathan Richard Schwarz, and Sham M Kakade. Color-filter: Conditional loss reduction filtering for targeted language model pre-training. arXiv preprint, 2024

  31. [31]

    Andrei Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences, 1997

  32. [32]

    A.Z. Broder. On the resemblance and containment of documents. InProceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), pp. 21–29, 1997. doi: 10.1109/ SEQUEN.1997.666900

  33. [33]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  34. [34]

    IGLUE: A benchmark for transfer learning across modalities, tasks, and languages

    Emanuele Bugliarello, Fangyu Liu, Jonas Pfeiffer, Siva Reddy, Desmond Elliott, Edoardo Maria Ponti, and Ivan Vulic. IGLUE: A benchmark for transfer learning across modalities, tasks, and languages. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.),International Conference on Machine Learning, ICML 2022, ...

  35. [35]

    Human alignment of large language models through online preference optimisation

    Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, et al. Human alignment of large language models through online preference optimisation. arXiv preprint arXiv:2403.08635, 2024

  36. [36]

    Extracting training data from large language models

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021

  37. [37]

    Quantifying memorization across neural language models, 2023

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models, 2023

  38. [38]

    Data- juicer: A one-stop data processing system for large language models

    Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, and Jingren Zhou. Data- juicer: A one-stop data processing system for large language models. In Companion of the 2024 International Conference on Management of Data, SIGMOD/PODS ’24, pp. 120–134, New York, NY , ...

  39. [39]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winte...

  40. [40]

    Skill-it! a data-driven skills framework for understanding and training language models

    Mayee Chen, Nicholas Roberts, Kush Bhatia, Jue WANG, Ce Zhang, Frederic Sala, and Christopher Ré. Skill-it! a data-driven skills framework for understanding and training language models. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 36000–36040. Curran Assoc...

  41. [41]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Benton C. Hutchinson, Reiner Pope, Ja...

  42. [42]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. ArXiv preprint, abs/2210.11416, 2022. URL https://arxiv.org/abs/ 2210.11416

  43. [43]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...

  44. [44]

    Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A. Smith. All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Lon...

  45. [45]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv preprint, abs/1803.05457, 2018. URL https://arxiv.org/abs/1803. 05457

  46. [47]

    URL https://arxiv.org/abs/2110.14168. 16

  47. [48]

    Common Crawl, 2007

    Common Crawl. Common Crawl, 2007. https://commoncrawl.org

  48. [49]

    Redpajama: an open dataset for training large language models, 2023

    Together Computer. Redpajama: an open dataset for training large language models, 2023. URLhttps://github.com/togethercomputer/RedPajama-Data

  49. [50]

    Cross-lingual language model pretraining

    Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché- Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vanc...

  50. [51]

    Unicode Standard Annex #29: Unicode Text Segmentation, 2023

    The Unicode Consortium. Unicode Standard Annex #29: Unicode Text Segmentation, 2023. URLhttps://www.unicode.org/reports/tr29/

  51. [52]

    Ultrafeedback: Boosting language models with high-quality feedback, 2023

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2023

  52. [53]

    DC-BENCH: Dataset condensation benchmark

    Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. DC-BENCH: Dataset condensation benchmark. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=Bs8iFQ7AM6

  53. [54]

    Scaling vision transformers to 22 billion parameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning (ICML), 2023. https://proceedings.mlr.press/v202/dehghani23a.html

  54. [55]

    Documenting large webtext corpora: A case study on the colossal clean crawled corpus

    Jesse Dodge, Maarten Sap, Ana Marasovi´c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1286–1305, Online and Punta Cana, Dominican Republic,

  55. [56]

    doi: 10.18653/v1/2021.emnlp-main.98

    Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.98. URLhttps://aclanthology.org/2021.emnlp-main.98

  56. [57]

    Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P

    Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P. Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen S. Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V . Le, Yonghui Wu, Zhifeng...

  57. [58]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024

  58. [60]

    URL https://arxiv.org/abs/2310.20707

  59. [61]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. ArXiv preprint, abs/2402.01306, 2024. URLhttps://arxiv.org/abs/2402.01306. 17

  60. [62]

    What’s going on with the open llm leaderboard? https://huggingface

    Hugging Face. What’s going on with the open llm leaderboard? https://huggingface. co/blog/open-llm-leaderboard-mmlu , 2023

  61. [63]

    Doge: Domain reweighting with generalization estimation

    Simin Fan, Matteo Pagliardini, and Martin Jaggi. Doge: Domain reweighting with generalization estimation. ArXiv preprint, abs/2310.15393, 2023. URL https://arxiv. org/abs/2310.15393

  62. [64]

    Data filtering networks

    Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks. ArXiv preprint, abs/2309.17425, 2023. URL https://arxiv.org/abs/2309.17425

  63. [65]

    Lighteval: A lightweight framework for llm evaluation, 2023

    Clémentine Fourrier, Nathan Habib, Thomas Wolf, and Lewis Tunstall. Lighteval: A lightweight framework for llm evaluation, 2023. URL https://github.com/ huggingface/lighteval

  64. [66]

    Dat- acomp: In search of the next generation of multimodal datasets

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, 2024. https://arxiv.org/abs/2304.14108

  65. [67]

    Language models scale reliably with over-training and on downstream tasks.arXiv preprint arXiv:2403.08540,

    Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Alexandros G. Dimakis, Gabriel Ilharco, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, and Ludwig ...

  66. [69]

    URL https://arxiv.org/abs/2101.00027

  67. [70]

    Data mixing made efficient: A bivariate scaling law for language model pretraining

    Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, and Bolin Ding. Data mixing made efficient: A bivariate scaling law for language model pretraining. ArXiv preprint, abs/2405.14908, 2024. URLhttps://arxiv.org/abs/2405.14908

  68. [71]

    Realtoxicityprompts: Evaluating neural toxic degeneration in language models

    Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. Findings of the Association for Computational Linguistics: EMNLP 2020, 2020

  69. [72]

    Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets

    Mor Geva, Yoav Goldberg, and Jonathan Berant. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pp. 1161...

  70. [73]

    Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

    Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics , 9:346–361, 2021. doi: 10. 1162/tacl_a_00370. URL https://aclanthology.org/2021.tacl-1.21

  71. [74]

    Non-expert evaluation of summarization systems is risky

    Dan Gillick and Yang Liu. Non-expert evaluation of summarization systems is risky. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 148–151, Los Angeles, 2010. Association for Computational Linguistics. URL https://aclanthology.org/W10-0722. 18

  72. [75]

    Zamba: A compact 7b ssm hybrid model

    Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybrid model. ArXiv preprint, abs/2405.16712, 2024. URL https://arxiv.org/abs/2405.16712

  73. [76]

    Openwebtext corpus, 2019

    Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus, 2019. http://Skylion007.github.io/OpenWebTextCorpus

  74. [77]

    Learning word vectors for 157 languages

    Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. Learning word vectors for 157 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , Miyazaki, Japan, 2018. European Language Resources Association (ELRA). URL https://aclanthology.org/ L18-1550

  75. [78]

    The big friendly filter

    Dirk Groeneveld. The big friendly filter. https://github.com/allenai/bff, 2023

  76. [79]

    OLMo: Accelerating the Science of Language Models

    Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Accelerating the science of language models. ArXiv preprint, abs/2402.00838, 2024. URL https:// arxiv.org/abs/2402.00838

  77. [80]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. ArXiv preprint, abs/2312.00752, 2023. URL https://arxiv.org/abs/2312.00752

  78. [81]

    Textbooks are all you need

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Cesar, Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need. Preprint, 2023. https:/...

  79. [82]

    OpenLM: a minimal but performative language modeling (lm) repository, 2023

    Suchin Gururangan, Mitchell Wortsman, Samir Yitzhak Gadre, Achal Dave, Maciej Kilian, Weijia Shi, Jean Mercat, Georgios Smyrnis, Gabriel Ilharco, Matt Jordan, Reinhard Heckel, Alex Dimakis, Ali Farhadi, Vaishaal Shankar, and Ludwig Schmidt. OpenLM: a minimal but performative language modeling (lm) repository, 2023. https://github. com/mlfoundations/open_lm

  80. [84]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ

Showing first 80 references.