pith. sign in

arxiv: 2605.12715 · v2 · pith:A55W7NL4new · submitted 2026-05-12 · 💻 cs.LG · cs.CL

Scaling Laws for Mixture Pretraining Under Data Constraints

Pith reviewed 2026-05-19 16:32 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords scaling lawsmixture pretrainingdata repetitionlanguage modelsdata constraintstarget domain performancepretraining optimizationgeneric data regularization
0
0 comments X

The pith

Scarce target data can be repeated 15-20 times in mixtures with generic data before performance plateaus, unlike single-source training, according to a new repetition-aware scaling law.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models require growing amounts of data as they scale, but many valuable target sources such as low-resource languages or specialized domains remain limited in size. The paper investigates the trade-off of mixing this scarce target data with abundant generic data and shows that repetition of the target examples becomes the dominant factor shaping final performance on the target domain. Mixtures tolerate substantially higher repetition rates than training on the target data alone because the generic data supplies regularization that slows overfitting. The authors derive a scaling law that explicitly models the diminishing returns on each additional repetition of target tokens and the stabilizing effect of the generic portion. Optimizing this law yields concrete recommendations for how much target data to include and how many times to reuse it given a fixed compute budget and model size.

Core claim

Across more than 2,000 training runs the central claim is that repetition of target tokens is the primary driver of target-domain performance in mixture pretraining, that mixtures can safely reuse scarce target corpora 15-20 times, and that a repetition-aware scaling law capturing the decreasing marginal value of repeated target tokens together with the regularizing contribution of generic data can be used to compute effective mixture ratios directly from target data size, compute budget, and model scale.

What carries the argument

The repetition-aware mixture scaling law, which modifies standard scaling relations to include a term for the loss of value on each repeated target token and an additive regularization benefit from generic data.

If this is right

  • Mixture ratios can be chosen by solving the scaling law rather than by running many separate training experiments.
  • Target data repetition can be set higher when generic data is present, reducing the total volume of unique target tokens needed.
  • Optimal repetition levels rise with model scale and compute budget but fall as the absolute size of the target corpus increases.
  • The same law applies across multilingual, domain-specific, and quality-filtered target sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be adapted to decide how much synthetic target data to generate when real data is exhausted.
  • Similar repetition-aware laws might apply to continued pretraining or instruction-tuning stages where domain data is also limited.
  • Hardware-aware versions of the law could incorporate per-token latency differences between target and generic batches.

Load-bearing premise

The repetition tolerance and scaling behavior observed across the tested model sizes and data types will continue to hold when the same approach is applied at larger scales or with different data distributions.

What would settle it

A controlled experiment at substantially larger model scale or with a new data type in which the measured optimal repetition count for a given target size and compute budget deviates by more than 30 percent from the value predicted by the scaling law.

read the original abstract

As language models scale, the amount of data they require grows -- yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable target data with abundant generic data, which presents a fundamental trade-off: too little target data in the mixture underexposes the model to the target domain, while too much target data repeats the same examples excessively, yielding diminishing returns and eventual overfitting. We study this trade-off across more than 2,000 language-model training runs spanning multiple model and target dataset sizes, as well as several data types, including multilingual, domain-specific, and quality-filtered mixtures. Across all settings, we find that repetition is a central driver of target-domain performance, and that mixture training tolerates much higher repetition than single-source training: scarce target corpora can be reused 15-20 times, with the optimal number of repetitions depending on the target data size, compute budget, and model scale. Next, we introduce a repetition-aware mixture scaling law that accounts for the decreasing value of repeated target tokens and the regularizing role of generic data. Optimizing the scaling law provides a principled way to compute effective mixture configurations, yielding practical mixture recommendations for pretraining under data constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines the trade-off in pretraining language models when mixing scarce target-domain data (e.g., low-resource languages or specialized domains) with abundant generic data. Across more than 2,000 training runs spanning model sizes, target dataset sizes, and data types (multilingual, domain-specific, quality-filtered), it reports that repetition of target tokens is a central driver of performance and that mixture training tolerates substantially higher repetition (optimal 15-20x) than single-source training, with the optimum depending on target size, compute budget, and model scale. It then introduces a repetition-aware mixture scaling law that models the decreasing value of repeated target tokens together with the regularizing effect of generic data; optimizing this law is claimed to yield principled mixture recommendations under data constraints.

Significance. If the scaling law is shown to be predictive rather than a post-hoc fit, the work would offer both empirical guidance and a practical tool for pretraining under realistic data scarcity, a common constraint in multilingual and domain-specific settings. The scale of the experimental campaign (>2000 runs) provides a solid empirical foundation for the repetition-tolerance claims and the dependence on target size, compute, and model scale.

major comments (2)
  1. [Scaling law section (post-experiments)] The repetition-aware mixture scaling law is introduced after the experimental results are presented. It is unclear whether its functional form and coefficients were derived independently (e.g., from first principles or a separate theoretical argument) or selected/tuned to match the measured performance curves from the same >2000 runs. If the latter, the claim that the law provides a 'principled way to compute effective mixture configurations' reduces to a descriptive fit whose extrapolation to new regimes or held-out repetition ratios remains untested.
  2. [Experimental results and figures] No error bars, confidence intervals, or exclusion criteria are reported for the performance measurements across the 2000+ runs. This makes it difficult to assess the statistical reliability of the reported optimal repetition counts (15-20x) and the dependence on target size, compute, and scale.
minor comments (2)
  1. [Abstract and introduction] The abstract and main text would benefit from an explicit statement of how the scaling-law parameters were obtained (e.g., least-squares fit on which subset of runs, or closed-form derivation).
  2. [Scaling law definition] Notation for the scaling law (e.g., symbols for repetition factor, effective tokens, regularization term) should be introduced with a clear table or equation reference to avoid ambiguity when the law is later optimized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we plan to make.

read point-by-point responses
  1. Referee: [Scaling law section (post-experiments)] The repetition-aware mixture scaling law is introduced after the experimental results are presented. It is unclear whether its functional form and coefficients were derived independently (e.g., from first principles or a separate theoretical argument) or selected/tuned to match the measured performance curves from the same >2000 runs. If the latter, the claim that the law provides a 'principled way to compute effective mixture configurations' reduces to a descriptive fit whose extrapolation to new regimes or held-out repetition ratios remains untested.

    Authors: The functional form is motivated by established scaling-law structures for diminishing returns under data repetition (building on prior work such as Hoffmann et al.) together with an explicit term for the regularizing contribution of generic data. Coefficients were obtained by fitting to the full set of runs. We acknowledge that this renders the law primarily empirical rather than derived from first principles. In the revision we will clarify this distinction in the text, add a dedicated subsection on model derivation, and report explicit held-out validation: we will reserve a subset of repetition ratios and target-size/compute combinations, refit on the remainder, and demonstrate that the law still predicts optimal mixtures with low error on the held-out points. revision: yes

  2. Referee: [Experimental results and figures] No error bars, confidence intervals, or exclusion criteria are reported for the performance measurements across the 2000+ runs. This makes it difficult to assess the statistical reliability of the reported optimal repetition counts (15-20x) and the dependence on target size, compute, and scale.

    Authors: We agree that error bars and explicit exclusion criteria would strengthen statistical assessment. Because of the computational cost of more than 2,000 full training runs, we did not repeat every configuration with independent random seeds. Nevertheless, the reported optimal repetition range (15-20x) and its dependence on target size, compute, and scale emerge consistently across multiple data regimes (multilingual, domain-specific, quality-filtered). In the revised manuscript we will (i) state the exclusion criteria used for outlier runs, (ii) add error bars to the primary figures based on repeated-seed subsets for representative model and data sizes, and (iii) include a short variability analysis quantifying the standard deviation observed across those repeats. revision: yes

Circularity Check

1 steps flagged

Repetition-aware scaling law fitted to experimental runs rather than independently derived from first principles

specific steps
  1. fitted input called prediction [Abstract]
    "Next, we introduce a repetition-aware mixture scaling law that accounts for the decreasing value of repeated target tokens and the regularizing role of generic data. Optimizing the scaling law provides a principled way to compute effective mixture configurations, yielding practical mixture recommendations for pretraining under data constraints."

    The law is introduced after describing the >2000 runs across model sizes and data types. Its parameters are selected or tuned so that the law reproduces the measured target-domain performance as a function of repetition and mixture ratio; the subsequent 'optimization' therefore outputs recommendations that are statistically forced by the fit rather than predicted from an external derivation.

full rationale

The paper conducts >2000 runs, then introduces a scaling law whose functional form accounts for repetition value and generic-data regularization. Optimizing this law is presented as yielding principled mixture recommendations. Because the law is introduced after the runs and its parameters are chosen to match the observed performance curves (as implied by the experimental scale and the claim of 'accounting for' the patterns), the recommendations reduce to a descriptive fit on the same data rather than an independent prediction. This is a moderate instance of fitted-input-called-prediction; the central claim still contains empirical content but the 'principled' status is not independently validated.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The scaling law relies on assumptions about repetition value decay and generic data regularization, with optimal repetition counts likely determined empirically from the runs.

free parameters (1)
  • optimal repetition count
    Observed range of 15-20 repetitions fitted or selected based on target data size, compute, and model scale in experiments.
axioms (1)
  • domain assumption Generic data provides regularization that mitigates overfitting from repeated target data.
    Invoked to explain why mixtures tolerate higher repetition than single-source training.

pith-pipeline@v0.9.0 · 5760 in / 1363 out tokens · 53261 ms · 2026-05-19T16:32:10.464199+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 9 internal anchors

  1. [1]

    Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models

    Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin El-Nouby, Joshua M Susskind, and Vimal Thilak. Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models. In International Conference on Machine Learning, pages 204--230. PMLR, 2025

  2. [2]

    Mix, don't tune: Bilingual pre-training outperforms hyperparameter search in data-constrained settings

    Anonymous. Mix, don't tune: Bilingual pre-training outperforms hyperparameter search in data-constrained settings. Submitted to NeurIPS 2026, 2026

  3. [3]

    Introducing claude opus 4.6

    Anthropic. Introducing claude opus 4.6. Antropic Annoucements, 2026. URL https://www.anthropic.com/news/claude-opus-4-6

  4. [4]

    SmolLM3: smol, multilingual, long-context reasoner

    Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Lewis Tunstall, Carlos Miguel Patiño, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gallouédec, Kashif Rasul, Nathan Habib, Clémentine Fourrier, Hynek Kydlicek, Guilherme Penedo, Hugo Larcher, Mathieu Morlon, Vaibhav Srivastav, Joshua Lochner, Xuan-Son Nguyen, Colin Raffel, Lean...

  5. [5]

    Scaling laws for forgetting during finetuning with pretraining data injection

    Louis B \'e thune, David Grangier, Dan Busbridge, Eleonora Gualdoni, Marco Cuturi, and Pierre Ablin. Scaling laws for forgetting during finetuning with pretraining data injection. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=vWMij23BmQ

  6. [6]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

  7. [7]

    Scaling parameter-constrained language models with quality data

    Ernie Chang, Matteo Paltenghi, Yang Li, Pin-Jie Lin, Changsheng Zhao, Patrick Huber, Zechun Liu, Rastislav Rabatin, Yangyang Shi, and Vikas Chandra. Scaling parameter-constrained language models with quality data. In Franck Dernoncourt, Daniel Preo t iuc-Pietro, and Anastasia Shimorina, editors, Proceedings of the 2024 Conference on Empirical Methods in N...

  8. [8]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

  9. [9]

    Nemotron-climb: Clustering-based iterative data mixture bootstrap- ping for language model pre-training.arXiv preprint arXiv:2504.13161, 2025

    Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, and Pavlo Molchanov. Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training, 2025. URL https://arxiv.org/abs/2504.13161

  10. [10]

    Essential-web v1.0: 24t tokens of organized web data, 2025

    Essential AI , :, Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar, Adarsh Chaluvaraju, Alok Tripathy, Anil Thomas, Ashish Tanwer, Darsh J Shah, Ishaan Shah, Karl Stratos, Khoi Nguyen, Kurt Smith, Michael Callahan, Peter Rushton, Philip Monk, Platon Mazarakis, Saad Jamal, Saurabh Srivastava, Somanshu Singla, and Ashish V...

  11. [11]

    Doge: domain reweighting with generalization estimation

    Simin Fan, Matteo Pagliardini, and Martin Jaggi. Doge: domain reweighting with generalization estimation. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

  12. [12]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The P ile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

  13. [13]

    Emogen: Emotional image content generation with text-to-image diffusion models,

    Sachin Goyal, Pratyush Maini, Zachary C. Lipton, Aditi Raghunathan, and J. Zico Kolter. Scaling laws for data filtering—data curation cannot be compute agnostic. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22702--22711, 2024. doi:10.1109/CVPR52733.2024.02142

  14. [14]

    Task-adaptive pretrained language models via clustered-importance sampling

    David Grangier, Simin Fan, Skyler Seto, and Pierre Ablin. Task-adaptive pretrained language models via clustered-importance sampling. In ICLR, 2025

  15. [15]

    Textbooks are all you need, June 2023

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Cesar, Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need, June 2023. URL https://...

  16. [16]

    Scaling laws and compute-optimal training beyond fixed training durations

    Alexander H \"a gele, Elie Bakouch, Atli Kosson, Loubna Ben allal, Leandro Von Werra, and Martin Jaggi. Scaling laws and compute-optimal training beyond fixed training durations. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=Y13gSfTjGr

  17. [17]

    Scaling Laws and Interpretability of Learning from Repeated Data

    Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, Scott Johnston, Ben Mann, Chris Olah, Catherine Olsson, Dario Amodei, Nicholas Joseph, Jared Kaplan, and Sam McCandlish. Scaling laws and interpretability of learning from repeated data, 2022. URL https://arxiv...

  18. [18]

    Rae, and Laurent Sifre

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...

  19. [19]

    The State and Fate of Linguistic Diversity and Inclusion in the NLP World

    Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state and fate of linguistic diversity and inclusion in the NLP world. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282--6293, Online, July 2020...

  20. [20]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  21. [21]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  22. [22]

    arXiv preprint arXiv:2402.07871 , year=

    Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, Marek Cygan, and Sebastian Jaszczur. Scaling laws for fine-grained mixture of experts, 2024. URL https://arxiv.org/abs/2402.07871

  23. [23]

    Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

    Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, 2018

  24. [24]

    DataComp-LM: In search of the next generation of training sets for language models

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardn...

  25. [25]

    From acceleration to saturation: Scaling behavior of bootstrapped language model pretraining

    Seng Pei Liew and Takuya Kato. From acceleration to saturation: Scaling behavior of bootstrapped language model pretraining. In NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling, 2025. URL https://openreview.net/forum?id=PhsneSYvWK

  26. [26]

    Advanced version of gemini with deep think officially achieves gold-medal standard at the international mathematical olympiad

    Thang Luong and Edward Lockhart. Advanced version of gemini with deep think officially achieves gold-medal standard at the international mathematical olympiad. Google DeepMind Blog, 1, 2025

  27. [27]

    Rephrasing the web: A recipe for compute and data-efficient language modeling

    Pratyush Maini, Skyler Seto, Richard Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly. Rephrasing the web: A recipe for compute and data-efficient language modeling. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14044--14072, 2024

  28. [28]

    Scaling data-constrained language models

    Niklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=j5BuTrEj35

  29. [29]

    Team OLMo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, S...

  30. [30]

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Ji...

  31. [31]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Katrin Erk and Noah A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational ...

  32. [32]

    Paster, M

    Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. OpenWebMath : An open dataset of high-quality mathematical web text. arXiv preprint arXiv:2310.06786, 2023

  33. [33]

    The FineWeb datasets: Decanting the web for the finest text data at scale

    Guilherme Penedo, Hynek Kydl \' c ek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale. In Advances in Neural Information Processing Systems, volume 37, 2024

  34. [34]

    Fineweb2: One pipeline to scale them all -- adapting pre-training data processing to every language, 2025

    Guilherme Penedo, Hynek Kydl \' c ek, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. FineWeb2 : One pipeline to scale them all -- adapting pre-training data processing to every language. arXiv preprint arXiv:2506.20920, 2025

  35. [35]

    Resolving discrepancies in compute-optimal scaling of language models

    Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, and Yair Carmon. Resolving discrepancies in compute-optimal scaling of language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=4fSSqpk1sM

  36. [36]

    D- CPT law: Domain-specific continual pre-training scaling law for large language models

    Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, ZhiqiBai, JiakaiWang, Yuanxing Zhang, Xu Tan, Jie Fu, Jiamang Wang, Lin Qu, Wenbo Su, and Bo Zheng. D- CPT law: Domain-specific continual pre-training scaling law for large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems,...

  37. [37]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 2019

  38. [38]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020

  39. [39]

    Winogrande: An adversarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Proceedings of AAAI, 2020

  40. [40]

    Training bilingual lms with data constraints in the targeted language

    Skyler Seto, Maartje Ter Hoeve, Richard He Bai, Natalie Schluter, and David Grangier. Training bilingual lms with data constraints in the targeted language. In Findings of the Association for Computational Linguistics: ACL 2025, pages 19096--19122, 2025

  41. [41]

    Optimal splitting of language models from mixtures to specialized domains

    Skyler Seto, Pierre Ablin, Anastasiia Filippova, Jiayuan Ye, Louis Bethune, Angelos Katharopoulos, and David Grangier. Optimal splitting of language models from mixtures to specialized domains. arXiv preprint arXiv:2603.19149, 2026

  42. [42]

    Scaling laws for optimal data mixtures

    Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, and Pierre Ablin. Scaling laws for optimal data mixtures. In NeurIPS, 2025. URL https://arxiv.org/abs/2507.09404

  43. [43]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2026

  44. [44]

    SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

    Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama . https://cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B

  45. [45]

    peS2o (pretraining efficiently on S2ORC ) dataset

    Luca Soldaini and Kyle Lo. peS2o (pretraining efficiently on S2ORC ) dataset. Technical report, Allen Institute for AI, 2023

  46. [46]

    doi: 10.18653/v1/2024.acl-long.840

    Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew Peters, Abhilasha Ravichander, Kyle Ri...

  47. [47]

    Nemotron- CC : Transforming C ommon C rawl into a refined long-horizon pretraining dataset

    Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron- CC : Transforming C ommon C rawl into a refined long-horizon pretraining dataset. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of ...

  48. [48]

    Scaling laws across model architectures: A comparative analysis of dense and M o E models in large language models

    Siqi Wang, Zhengyu Chen, Bei Li, Keqing He, Min Zhang, and Jingang Wang. Scaling laws across model architectures: A comparative analysis of dense and M o E models in large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5583--5595,...

  49. [49]

    Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://open...

  50. [50]

    Liu, and Matt Gardner

    Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In Leon Derczynski, Wei Xu, Alan Ritter, and Tim Baldwin, editors, Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94--106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:10.18653/v1/W17-4413. URL https...

  51. [51]

    15 Advancing Mathematics Research with AI-Driven Formal Proof Search Accelerating scientific research with gemini: Case studies and common techniques

    David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, MohammadHossein Bateni, Simina Branzei, Michael P. Brenner, Lin Chen, Ying Feng, Lance Fortnow, Gang Fu, Ziyi Guan, Zahra Hadizadeh, Mohammad T. Hajiaghayi, Mahdi JafariRaviz, Adel Javanmard, Karthik C. S., Ken ichi Kawarabayashi, Ravi Kumar, Silvio Lattanzi, Euiwoong Lee, Yi Li, I...

  52. [52]

    DoReMi : Optimizing data mixtures speeds up language model pretraining

    Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. DoReMi : Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36, 2023

  53. [53]

    Data mixing laws: Optimizing data mixtures by predicting language modeling performance

    Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu. Data mixing laws: Optimizing data mixtures by predicting language modeling performance. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=jjCB27TMK3

  54. [54]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. H ella S wag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Llu \'i s M \`a rquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, Florence, Italy, July 2019. Association for Computatio...

  55. [55]

    When scaling meets LLM finetuning: The effect of data, model and finetuning method

    Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. When scaling meets LLM finetuning: The effect of data, model and finetuning method. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5HCnKDeTws