DataComp-LM: In search of the next generation of training sets for language models
Pith reviewed 2026-05-17 22:53 UTC · model grok-4.3
The pith
Model-based filtering of web text produces training sets that let 7B language models reach 64% MMLU with 2.6T tokens and 40% less compute than prior open models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Model-based filtering is the key mechanism for assembling high-quality pretraining data. Applied to a large Common Crawl extract, it yields DCLM-Baseline, which supports training a 7B language model to 64% 5-shot accuracy on MMLU using 2.6T tokens. The same model improves 6.6 percentage points over MAP-Neo on MMLU, performs comparably to Mistral-7B-v0.3 and Llama 3 8B on that benchmark, and matches their average score across 53 natural language tasks while requiring 6.6 times less compute than Llama 3 8B.
What carries the argument
Model-based filtering, which trains a smaller classifier on high-quality seed data and then scores and retains only the top documents from the full corpus.
If this is right
- Systematic comparison of data strategies becomes possible at multiple scales using the shared corpus and evaluation suite.
- High-quality filtered data measurably lowers the compute needed to reach competitive performance on standard benchmarks.
- Open training sets can now reach parity with some closed-source 7-8B models on MMLU and the broader 53-task average.
- Further gains from deduplication, mixing ratios, or alternative filters can be measured directly against the same baseline.
Where Pith is reading between the lines
- Data quality choices may offer a more immediate performance lever than additional scale in the current 1-7B regime.
- A broad multi-task evaluation suite gives a more stable signal for data experiments than reliance on MMLU alone.
- The same filtering recipe could be tested on non-English or domain-specific corpora to measure cross-domain transfer.
- Re-running the baseline with a different base model family would reveal whether the gains depend on the filter model architecture.
Load-bearing premise
The particular filtering thresholds and the 53-task evaluation suite chosen here will keep producing strong results when the same method is applied at other model scales, to new data sources, or with future architectures.
What would settle it
Train a 7B model on the identical 240T-token corpus but without the model-based filter step and check whether 5-shot MMLU accuracy falls below 58% or requires substantially more than 4T tokens to recover the 64% mark.
read the original abstract
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DataComp-LM (DCLM), a testbed and benchmark for controlled experiments on data curation strategies (deduplication, filtering, mixing) for language model pretraining. It releases a standardized 240T-token corpus from Common Crawl, OpenLM-based training recipes, and a suite of 53 downstream evaluations spanning scales from 412M to 7B parameters. The central empirical result is that model-based filtering is the key ingredient for high-quality training sets; the resulting DCLM-Baseline enables a 7B model trained on 2.6T tokens to reach 64% 5-shot MMLU accuracy, a 6.6-point gain over MAP-Neo with 40% less compute, while remaining competitive with Mistral-7B-v0.3 and Llama 3 8B on MMLU and the average of the 53 tasks.
Significance. If the results hold under controlled conditions, the work is significant for establishing a reproducible platform that shifts emphasis toward data-centric methods in LLM training. The multi-scale experiments, broad evaluation suite, and open release of the corpus and recipes provide concrete value for the community and support further research on dataset design. The reported performance deltas illustrate that careful filtering can deliver gains comparable to those from additional compute or scale.
major comments (1)
- [§4] §4 and associated experimental tables: The claim that 'model-based filtering is key' and directly produces the 6.6-point MMLU gain requires isolation of the filtering variable. The DCLM-Baseline differs from the MAP-Neo reference in data volume (2.6T tokens), deduplication, and mixing ratios in addition to the filter itself. A controlled contrast that fixes the underlying corpus, total token count, and all other pipeline steps while varying only the filter (heuristic vs. model-based) is needed to support causal attribution of the gains.
minor comments (3)
- The precise model-based filtering thresholds, classifier details, and exclusion rules are referenced but not fully specified in the main text; including pseudocode or a dedicated appendix subsection would improve reproducibility.
- [Experimental tables] Performance tables would benefit from reporting variance or results across multiple random seeds to allow assessment of the robustness of the reported improvements.
- [Abstract] The abstract states '40% less compute' without defining the metric (e.g., total FLOPs or wall-clock GPU hours); adding this clarification would aid direct comparisons.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for highlighting the value of DataComp-LM as a reproducible testbed. We address the major comment below and will revise the manuscript to clarify the scope of our claims and the controlled nature of our ablations.
read point-by-point responses
-
Referee: [§4] §4 and associated experimental tables: The claim that 'model-based filtering is key' and directly produces the 6.6-point MMLU gain requires isolation of the filtering variable. The DCLM-Baseline differs from the MAP-Neo reference in data volume (2.6T tokens), deduplication, and mixing ratios in addition to the filter itself. A controlled contrast that fixes the underlying corpus, total token count, and all other pipeline steps while varying only the filter (heuristic vs. model-based) is needed to support causal attribution of the gains.
Authors: We agree that a fully isolated comparison strengthens causal attribution and appreciate the referee pointing this out. Our manuscript already reports controlled experiments in §4 that fix the underlying DCLM corpus, total token count, deduplication, and mixing ratios while varying only the filtering strategy (heuristic vs. model-based). These ablations demonstrate that model-based filtering is the key driver of performance gains within our testbed. The DCLM-Baseline vs. MAP-Neo comparison is presented as an end-to-end benchmark of our full pipeline against prior open-data work rather than an isolated ablation of the filter. We will revise §4 to explicitly distinguish the internal controlled contrasts from the external benchmark comparison and to qualify that the 6.6-point MMLU gain reflects the cumulative pipeline (including but not limited to model-based filtering). revision: yes
Circularity Check
No circularity in empirical dataset curation results
full rationale
The paper introduces an empirical benchmark (DCLM) and reports performance numbers from training runs on curated data subsets. Central claims rest on direct measurements of downstream accuracy (e.g., MMLU scores) using held-out evaluations rather than any derived quantities, fitted parameters renamed as predictions, or self-citation chains that substitute for independent justification. No equations, uniqueness theorems, or ansatzes are invoked whose validity reduces to the paper's own inputs by construction; results are therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Next-token prediction on filtered web text produces useful general capabilities
Lean theorems connected to this paper
-
Foundation.DimensionForcingalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we conduct extensive experiments and find that model-based filtering is key
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 17 Pith papers
-
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
-
Projection-Free Transformers via Gaussian Kernel Attention
Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.
-
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
-
Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data
Synthetic pre-pre-training on structured data improves LLM robustness to noisy pre-training, matching baseline loss with up to 49% fewer natural tokens for a 1B model.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
-
NH-CROP: Robust Pricing for Governed Language Data Assets under Cost Uncertainty
NH-CROP introduces a robust online pricing method for governed language data with uncertain costs, using a selective verification gate that improves or matches baselines without relying heavily on paid information acq...
-
Compute Optimal Tokenization
Compute-optimal language models require parameter count to scale with data bytes rather than tokens, with optimal token compression rate decreasing as compute budget grows.
-
Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling
X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scale...
-
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Expert upcycling expands MoE models by duplicating experts and continuing pre-training, matching baseline performance while saving 32% GPU hours in 7B-13B experiments.
-
Parcae: Scaling Laws For Stable Looped Language Models
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
-
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
-
CoFrGeNet: Continued Fraction Architectures for Language Generation
CoFrGeNet uses continued-fraction function classes to build transformer replacements that match or beat GPT-2 and Llama performance with half to two-thirds the parameters.
-
Muon is Scalable for LLM Training
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
-
Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework
Matrix provides a peer-to-peer multi-agent system for synthetic data generation that scales to tens of thousands of workflows and delivers 2-15x higher throughput than centralized designs without quality loss.
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
-
XekRung Technical Report
XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.
Reference graph
Works this paper leans on
-
[1]
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication, 2023. URL https: //arxiv.org/abs/2303.09540
work page internal anchor Pith review arXiv 2023
-
[2]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. ArXiv preprint, abs/2404.14219, 2024. URL https://arxiv.org/abs/2404.14219
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Amit Agarwal, Hema Swetha Koppula, Krishna P. Leela, Krishna Prasad Chitrapura, Sachin Garg, Pavan Kumar GM, Chittaranjan Haty, Anirban Roy, and Amit Sasturkar. Url normalization for de-duplication of web pages. In ACM Conference on Information and Knowledge Management, 2009. https://doi.org/10.1145/1645953.1646283
-
[4]
Introducing meta llama 3: The most capable openly available llm to date, 2024
Meta AI. Introducing meta llama 3: The most capable openly available llm to date, 2024. https://ai.meta.com/blog/meta-llama-3/
work page 2024
-
[5]
FETA: A benchmark for few-sample task transfer in open-domain dialogue
Alon Albalak, Yi-Lin Tuan, Pegah Jandaghi, Connor Pryor, Luke Yoffe, Deepak Ramachandran, Lise Getoor, Jay Pujara, and William Yang Wang. FETA: A benchmark for few-sample task transfer in open-domain dialogue. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pp. 10936–10953, Abu Dhabi, United Arab Emirates, 2022....
work page 2022
-
[6]
Efficient online data mixing for language model pre-training
Alon Albalak, Liangming Pan, Colin Raffel, and William Yang Wang. Efficient online data mixing for language model pre-training. ArXiv preprint, abs/2312.02406, 2023. URL https://arxiv.org/abs/2312.02406
-
[7]
Improving few-shot generalization by exploring and exploiting auxiliary data
Alon Albalak, Colin Raffel, and William Yang Wang. Improving few-shot generalization by exploring and exploiting auxiliary data. In Advances in Neural Information Processing Systems (NeurIPS), 2023. https://openreview.net/forum?id=JDnLXc4NOn
work page 2023
-
[8]
A survey on 12 data selection for language models
Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, et al. A survey on 12 data selection for language models. ArXiv preprint, abs/2402.16827, 2024. URL https: //arxiv.org/abs/2402.16827
-
[9]
Santacoder: Don’t reach for the stars!
Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. Santacoder: don’t reach for the stars! ArXiv preprint, abs/2301.03988, 2023. URL https://arxiv.org/abs/2301.03988
-
[10]
The Falcon Series of Open Language Models
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra- Aimée Cojocaru, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
M ath QA : Towards interpretable math word problem solving with operation-based formalisms
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and...
-
[12]
Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, and Mansheej Paul. Perplexed by perplexity: Perplexity-based data pruning with small reference models. ArXiv preprint, abs/2405.20541, 2024. URL https://arxiv.org/abs/2405. 20541
-
[13]
Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, David Berard, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Laurent Kirsch, Michael Lazos, Yanbo Liang, Jason Liang, Yinghai Lu, CK Luk...
work page 2024
-
[14]
Llemma: An open language model for mathematics
Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. ArXiv preprint, abs/2310.10631, 2023. URL https://arxiv.org/ abs/2310.10631
-
[15]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.ArXiv preprint, abs/1607.06450, 2016. URL https://arxiv.org/abs/1607.06450
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[16]
Hritik Bansal, Ashima Suvarna, Gantavya Bhatt, Nanyun Peng, Kai-Wei Chang, and Aditya Grover. Comparing bad apples to good oranges: Aligning large language models via joint preference optimization. arXiv preprint arXiv:2404.00530, 2024
-
[17]
Trafilatura: A web scraping library and command-line tool for text discovery and extraction
Adrien Barbaresi. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 122–131, Online, 2021. Association for Computational ...
-
[18]
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. In Transactions on Machine Learning Research (TMLR), 2023. https: //openreview.net/forum?id=uyTL5Bvosj
work page 2023
-
[19]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021
work page 2021
-
[20]
Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl
Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl. In European Conference on Information Retrieval Research (ECIR) , 2018. https://github.com/chatnoir-eu/ chatnoir-resiliparse
work page 2018
-
[21]
FastWARC: Optimizing Large-Scale Web Archive Analytics
Janek Bevendorff, Martin Potthast, and Benno Stein. FastWARC: Optimizing Large-Scale Web Archive Analytics. In International Symposium on Open Search Technology (OSSYM),
-
[22]
https://github.com/chatnoir-eu/chatnoir-resiliparse
-
[23]
DeepSeek-AI Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wen-Hui Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y . K. Li, Wenfeng Liang, Fangyun Lin, A. X....
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
PIQA: reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Ronan LeBras, Jianfeng Gao, and Yejin Choi. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in ...
work page 2020
-
[25]
URL https://aaai.org/ojs/index.php/AAAI/article/ view/6239
AAAI Press, 2020. URL https://aaai.org/ojs/index.php/AAAI/article/ view/6239
work page 2020
-
[26]
GPT- NeoX-20B: An open-source autoregressive language model
Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, Usvsn Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT- NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode...
work page 2022
-
[27]
Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors.Communications of the ACM, 1970. https://doi.org/10.1145/362686.362692
-
[28]
Space/time trade-offs in hash coding with allowable errors
Burton H Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422–426, 1970
work page 1970
-
[29]
Nuanced metrics for measuring unintended bias with real data for text classification
Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion proceedings of the 2019 world wide web conference, pp. 491–500, 2019. 14
work page 2019
-
[30]
Color-filter: Conditional loss reduction filtering for targeted language model pre-training
David Brandfonbrener, Hanlin Zhang, Andreas Kirsch, Jonathan Richard Schwarz, and Sham M Kakade. Color-filter: Conditional loss reduction filtering for targeted language model pre-training. arXiv preprint, 2024
work page 2024
-
[31]
Andrei Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences, 1997
work page 1997
- [32]
-
[33]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
work page 2020
-
[34]
IGLUE: A benchmark for transfer learning across modalities, tasks, and languages
Emanuele Bugliarello, Fangyu Liu, Jonas Pfeiffer, Siva Reddy, Desmond Elliott, Edoardo Maria Ponti, and Ivan Vulic. IGLUE: A benchmark for transfer learning across modalities, tasks, and languages. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.),International Conference on Machine Learning, ICML 2022, ...
work page 2022
-
[35]
Human alignment of large language models through online preference optimisation
Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, et al. Human alignment of large language models through online preference optimisation. arXiv preprint arXiv:2403.08635, 2024
-
[36]
Extracting training data from large language models
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021
work page 2021
-
[37]
Quantifying memorization across neural language models, 2023
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models, 2023
work page 2023
-
[38]
Data- juicer: A one-stop data processing system for large language models
Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, and Jingren Zhou. Data- juicer: A one-stop data processing system for large language models. In Companion of the 2024 International Conference on Management of Data, SIGMOD/PODS ’24, pp. 120–134, New York, NY , ...
-
[39]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winte...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[40]
Skill-it! a data-driven skills framework for understanding and training language models
Mayee Chen, Nicholas Roberts, Kush Bhatia, Jue WANG, Ce Zhang, Frederic Sala, and Christopher Ré. Skill-it! a data-driven skills framework for understanding and training language models. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 36000–36040. Curran Assoc...
work page 2023
-
[41]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Benton C. Hutchinson, Reiner Pope, Ja...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[42]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. ArXiv preprint, abs/2210.11416, 2022. URL https://arxiv.org/abs/ 2210.11416
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
BoolQ: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...
-
[44]
Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A. Smith. All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Lon...
work page 2021
-
[45]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv preprint, abs/1803.05457, 2018. URL https://arxiv.org/abs/1803. 05457
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[47]
URL https://arxiv.org/abs/2110.14168. 16
work page internal anchor Pith review Pith/arXiv arXiv
- [48]
-
[49]
Redpajama: an open dataset for training large language models, 2023
Together Computer. Redpajama: an open dataset for training large language models, 2023. URLhttps://github.com/togethercomputer/RedPajama-Data
work page 2023
-
[50]
Cross-lingual language model pretraining
Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché- Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vanc...
work page 2019
-
[51]
Unicode Standard Annex #29: Unicode Text Segmentation, 2023
The Unicode Consortium. Unicode Standard Annex #29: Unicode Text Segmentation, 2023. URLhttps://www.unicode.org/reports/tr29/
work page 2023
-
[52]
Ultrafeedback: Boosting language models with high-quality feedback, 2023
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2023
work page 2023
-
[53]
DC-BENCH: Dataset condensation benchmark
Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. DC-BENCH: Dataset condensation benchmark. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=Bs8iFQ7AM6
work page 2022
-
[54]
Scaling vision transformers to 22 billion parameters
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning (ICML), 2023. https://proceedings.mlr.press/v202/dehghani23a.html
work page 2023
-
[55]
Documenting large webtext corpora: A case study on the colossal clean crawled corpus
Jesse Dodge, Maarten Sap, Ana Marasovi´c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1286–1305, Online and Punta Cana, Dominican Republic,
work page 2021
-
[56]
doi: 10.18653/v1/2021.emnlp-main.98
Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.98. URLhttps://aclanthology.org/2021.emnlp-main.98
-
[57]
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P. Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen S. Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V . Le, Yonghui Wu, Zhifeng...
work page 2022
-
[58]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [60]
-
[61]
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. ArXiv preprint, abs/2402.01306, 2024. URLhttps://arxiv.org/abs/2402.01306. 17
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
What’s going on with the open llm leaderboard? https://huggingface
Hugging Face. What’s going on with the open llm leaderboard? https://huggingface. co/blog/open-llm-leaderboard-mmlu , 2023
work page 2023
-
[63]
Doge: Domain reweighting with generalization estimation
Simin Fan, Matteo Pagliardini, and Martin Jaggi. Doge: Domain reweighting with generalization estimation. ArXiv preprint, abs/2310.15393, 2023. URL https://arxiv. org/abs/2310.15393
-
[64]
Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks. ArXiv preprint, abs/2309.17425, 2023. URL https://arxiv.org/abs/2309.17425
-
[65]
Lighteval: A lightweight framework for llm evaluation, 2023
Clémentine Fourrier, Nathan Habib, Thomas Wolf, and Lewis Tunstall. Lighteval: A lightweight framework for llm evaluation, 2023. URL https://github.com/ huggingface/lighteval
work page 2023
-
[66]
Dat- acomp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, 2024. https://arxiv.org/abs/2304.14108
-
[67]
Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Alexandros G. Dimakis, Gabriel Ilharco, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, and Ludwig ...
-
[69]
URL https://arxiv.org/abs/2101.00027
work page internal anchor Pith review Pith/arXiv arXiv
-
[70]
Data mixing made efficient: A bivariate scaling law for language model pretraining
Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, and Bolin Ding. Data mixing made efficient: A bivariate scaling law for language model pretraining. ArXiv preprint, abs/2405.14908, 2024. URLhttps://arxiv.org/abs/2405.14908
-
[71]
Realtoxicityprompts: Evaluating neural toxic degeneration in language models
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. Findings of the Association for Computational Linguistics: EMNLP 2020, 2020
work page 2020
-
[72]
Mor Geva, Yoav Goldberg, and Jonathan Berant. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pp. 1161...
-
[73]
Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics , 9:346–361, 2021. doi: 10. 1162/tacl_a_00370. URL https://aclanthology.org/2021.tacl-1.21
work page 2021
-
[74]
Non-expert evaluation of summarization systems is risky
Dan Gillick and Yang Liu. Non-expert evaluation of summarization systems is risky. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 148–151, Los Angeles, 2010. Association for Computational Linguistics. URL https://aclanthology.org/W10-0722. 18
work page 2010
-
[75]
Zamba: A compact 7b ssm hybrid model
Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybrid model. ArXiv preprint, abs/2405.16712, 2024. URL https://arxiv.org/abs/2405.16712
-
[76]
Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus, 2019. http://Skylion007.github.io/OpenWebTextCorpus
work page 2019
-
[77]
Learning word vectors for 157 languages
Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. Learning word vectors for 157 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , Miyazaki, Japan, 2018. European Language Resources Association (ELRA). URL https://aclanthology.org/ L18-1550
work page 2018
-
[78]
Dirk Groeneveld. The big friendly filter. https://github.com/allenai/bff, 2023
work page 2023
-
[79]
OLMo: Accelerating the Science of Language Models
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Accelerating the science of language models. ArXiv preprint, abs/2402.00838, 2024. URL https:// arxiv.org/abs/2402.00838
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[80]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. ArXiv preprint, abs/2312.00752, 2023. URL https://arxiv.org/abs/2312.00752
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[81]
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Cesar, Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need. Preprint, 2023. https:/...
work page 2023
-
[82]
OpenLM: a minimal but performative language modeling (lm) repository, 2023
Suchin Gururangan, Mitchell Wortsman, Samir Yitzhak Gadre, Achal Dave, Maciej Kilian, Weijia Shi, Jean Mercat, Georgios Smyrnis, Gabriel Ilharco, Matt Jordan, Reinhard Heckel, Alex Dimakis, Ali Farhadi, Vaishaal Shankar, and Ludwig Schmidt. OpenLM: a minimal but performative language modeling (lm) repository, 2023. https://github. com/mlfoundations/open_lm
work page 2023
-
[84]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.