Scaling Laws for Mixture Pretraining Under Data Constraints

Anastasiia Sedova; Natalie Schluter; Pierre Ablin; Skyler Seto

arxiv: 2605.12715 · v2 · pith:A55W7NL4new · submitted 2026-05-12 · 💻 cs.LG · cs.CL

Scaling Laws for Mixture Pretraining Under Data Constraints

Anastasiia Sedova , Skyler Seto , Natalie Schluter , Pierre Ablin This is my paper

Pith reviewed 2026-05-19 16:32 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords scaling lawsmixture pretrainingdata repetitionlanguage modelsdata constraintstarget domain performancepretraining optimizationgeneric data regularization

0 comments

The pith

Scarce target data can be repeated 15-20 times in mixtures with generic data before performance plateaus, unlike single-source training, according to a new repetition-aware scaling law.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models require growing amounts of data as they scale, but many valuable target sources such as low-resource languages or specialized domains remain limited in size. The paper investigates the trade-off of mixing this scarce target data with abundant generic data and shows that repetition of the target examples becomes the dominant factor shaping final performance on the target domain. Mixtures tolerate substantially higher repetition rates than training on the target data alone because the generic data supplies regularization that slows overfitting. The authors derive a scaling law that explicitly models the diminishing returns on each additional repetition of target tokens and the stabilizing effect of the generic portion. Optimizing this law yields concrete recommendations for how much target data to include and how many times to reuse it given a fixed compute budget and model size.

Core claim

Across more than 2,000 training runs the central claim is that repetition of target tokens is the primary driver of target-domain performance in mixture pretraining, that mixtures can safely reuse scarce target corpora 15-20 times, and that a repetition-aware scaling law capturing the decreasing marginal value of repeated target tokens together with the regularizing contribution of generic data can be used to compute effective mixture ratios directly from target data size, compute budget, and model scale.

What carries the argument

The repetition-aware mixture scaling law, which modifies standard scaling relations to include a term for the loss of value on each repeated target token and an additive regularization benefit from generic data.

If this is right

Mixture ratios can be chosen by solving the scaling law rather than by running many separate training experiments.
Target data repetition can be set higher when generic data is present, reducing the total volume of unique target tokens needed.
Optimal repetition levels rise with model scale and compute budget but fall as the absolute size of the target corpus increases.
The same law applies across multilingual, domain-specific, and quality-filtered target sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be adapted to decide how much synthetic target data to generate when real data is exhausted.
Similar repetition-aware laws might apply to continued pretraining or instruction-tuning stages where domain data is also limited.
Hardware-aware versions of the law could incorporate per-token latency differences between target and generic batches.

Load-bearing premise

The repetition tolerance and scaling behavior observed across the tested model sizes and data types will continue to hold when the same approach is applied at larger scales or with different data distributions.

What would settle it

A controlled experiment at substantially larger model scale or with a new data type in which the measured optimal repetition count for a given target size and compute budget deviates by more than 30 percent from the value predicted by the scaling law.

read the original abstract

As language models scale, the amount of data they require grows -- yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable target data with abundant generic data, which presents a fundamental trade-off: too little target data in the mixture underexposes the model to the target domain, while too much target data repeats the same examples excessively, yielding diminishing returns and eventual overfitting. We study this trade-off across more than 2,000 language-model training runs spanning multiple model and target dataset sizes, as well as several data types, including multilingual, domain-specific, and quality-filtered mixtures. Across all settings, we find that repetition is a central driver of target-domain performance, and that mixture training tolerates much higher repetition than single-source training: scarce target corpora can be reused 15-20 times, with the optimal number of repetitions depending on the target data size, compute budget, and model scale. Next, we introduce a repetition-aware mixture scaling law that accounts for the decreasing value of repeated target tokens and the regularizing role of generic data. Optimizing the scaling law provides a principled way to compute effective mixture configurations, yielding practical mixture recommendations for pretraining under data constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mixture training lets you repeat scarce target data 15-20 times with less harm than single-source training, backed by a large experiment set, though the scaling law looks fitted after the fact.

read the letter

The main thing to know is that this paper measures how much repetition of limited target data you can get away with when mixing it with generic data during pretraining. They find the sweet spot often lands around 15-20 repeats, varying with target size, compute, and model scale, and that mixtures handle repeats better than training only on the target set. Over 2000 runs across model sizes and data types like multilingual or domain-specific give the repetition claim some weight. They also introduce a scaling law meant to capture falling value from repeated target tokens plus regularization from the generic data, then optimize it for mixture recommendations. That combination of scale and a concrete guideline is the useful part. The experiments appear consistent across the tested regimes, which is better than many scaling-law papers that rely on fewer points. The soft spot is the scaling law. It comes after the runs are described, so the functional form and parameters may have been chosen to match the observed performance curves rather than derived first or checked on held-out repetition ratios. Without details on fitting procedure, validation splits, or error bars, it is hard to tell how well it would predict for new data sizes or model scales. The abstract does not address that directly. This work is aimed at people who train models on constrained data, such as low-resource languages or specialized domains. A practitioner who needs numbers to decide how much to upsample or repeat target tokens under a fixed compute budget would find the repetition tolerance and mixture advice directly usable. The experiment volume and the practical question make it worth a serious referee, even if the scaling law needs tighter validation. I would send it out for peer review and ask reviewers to check the law's independence from the data it explains.

Referee Report

2 major / 2 minor

Summary. The paper examines the trade-off in pretraining language models when mixing scarce target-domain data (e.g., low-resource languages or specialized domains) with abundant generic data. Across more than 2,000 training runs spanning model sizes, target dataset sizes, and data types (multilingual, domain-specific, quality-filtered), it reports that repetition of target tokens is a central driver of performance and that mixture training tolerates substantially higher repetition (optimal 15-20x) than single-source training, with the optimum depending on target size, compute budget, and model scale. It then introduces a repetition-aware mixture scaling law that models the decreasing value of repeated target tokens together with the regularizing effect of generic data; optimizing this law is claimed to yield principled mixture recommendations under data constraints.

Significance. If the scaling law is shown to be predictive rather than a post-hoc fit, the work would offer both empirical guidance and a practical tool for pretraining under realistic data scarcity, a common constraint in multilingual and domain-specific settings. The scale of the experimental campaign (>2000 runs) provides a solid empirical foundation for the repetition-tolerance claims and the dependence on target size, compute, and model scale.

major comments (2)

[Scaling law section (post-experiments)] The repetition-aware mixture scaling law is introduced after the experimental results are presented. It is unclear whether its functional form and coefficients were derived independently (e.g., from first principles or a separate theoretical argument) or selected/tuned to match the measured performance curves from the same >2000 runs. If the latter, the claim that the law provides a 'principled way to compute effective mixture configurations' reduces to a descriptive fit whose extrapolation to new regimes or held-out repetition ratios remains untested.
[Experimental results and figures] No error bars, confidence intervals, or exclusion criteria are reported for the performance measurements across the 2000+ runs. This makes it difficult to assess the statistical reliability of the reported optimal repetition counts (15-20x) and the dependence on target size, compute, and scale.

minor comments (2)

[Abstract and introduction] The abstract and main text would benefit from an explicit statement of how the scaling-law parameters were obtained (e.g., least-squares fit on which subset of runs, or closed-form derivation).
[Scaling law definition] Notation for the scaling law (e.g., symbols for repetition factor, effective tokens, regularization term) should be introduced with a clear table or equation reference to avoid ambiguity when the law is later optimized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we plan to make.

read point-by-point responses

Referee: [Scaling law section (post-experiments)] The repetition-aware mixture scaling law is introduced after the experimental results are presented. It is unclear whether its functional form and coefficients were derived independently (e.g., from first principles or a separate theoretical argument) or selected/tuned to match the measured performance curves from the same >2000 runs. If the latter, the claim that the law provides a 'principled way to compute effective mixture configurations' reduces to a descriptive fit whose extrapolation to new regimes or held-out repetition ratios remains untested.

Authors: The functional form is motivated by established scaling-law structures for diminishing returns under data repetition (building on prior work such as Hoffmann et al.) together with an explicit term for the regularizing contribution of generic data. Coefficients were obtained by fitting to the full set of runs. We acknowledge that this renders the law primarily empirical rather than derived from first principles. In the revision we will clarify this distinction in the text, add a dedicated subsection on model derivation, and report explicit held-out validation: we will reserve a subset of repetition ratios and target-size/compute combinations, refit on the remainder, and demonstrate that the law still predicts optimal mixtures with low error on the held-out points. revision: yes
Referee: [Experimental results and figures] No error bars, confidence intervals, or exclusion criteria are reported for the performance measurements across the 2000+ runs. This makes it difficult to assess the statistical reliability of the reported optimal repetition counts (15-20x) and the dependence on target size, compute, and scale.

Authors: We agree that error bars and explicit exclusion criteria would strengthen statistical assessment. Because of the computational cost of more than 2,000 full training runs, we did not repeat every configuration with independent random seeds. Nevertheless, the reported optimal repetition range (15-20x) and its dependence on target size, compute, and scale emerge consistently across multiple data regimes (multilingual, domain-specific, quality-filtered). In the revised manuscript we will (i) state the exclusion criteria used for outlier runs, (ii) add error bars to the primary figures based on repeated-seed subsets for representative model and data sizes, and (iii) include a short variability analysis quantifying the standard deviation observed across those repeats. revision: yes

Circularity Check

1 steps flagged

Repetition-aware scaling law fitted to experimental runs rather than independently derived from first principles

specific steps

fitted input called prediction [Abstract]
"Next, we introduce a repetition-aware mixture scaling law that accounts for the decreasing value of repeated target tokens and the regularizing role of generic data. Optimizing the scaling law provides a principled way to compute effective mixture configurations, yielding practical mixture recommendations for pretraining under data constraints."

The law is introduced after describing the >2000 runs across model sizes and data types. Its parameters are selected or tuned so that the law reproduces the measured target-domain performance as a function of repetition and mixture ratio; the subsequent 'optimization' therefore outputs recommendations that are statistically forced by the fit rather than predicted from an external derivation.

full rationale

The paper conducts >2000 runs, then introduces a scaling law whose functional form accounts for repetition value and generic-data regularization. Optimizing this law is presented as yielding principled mixture recommendations. Because the law is introduced after the runs and its parameters are chosen to match the observed performance curves (as implied by the experimental scale and the claim of 'accounting for' the patterns), the recommendations reduce to a descriptive fit on the same data rather than an independent prediction. This is a moderate instance of fitted-input-called-prediction; the central claim still contains empirical content but the 'principled' status is not independently validated.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The scaling law relies on assumptions about repetition value decay and generic data regularization, with optimal repetition counts likely determined empirically from the runs.

free parameters (1)

optimal repetition count
Observed range of 15-20 repetitions fitted or selected based on target data size, compute, and model scale in experiments.

axioms (1)

domain assumption Generic data provides regularization that mitigates overfitting from repeated target data.
Invoked to explain why mixtures tolerate higher repetition than single-source training.

pith-pipeline@v0.9.0 · 5760 in / 1363 out tokens · 53261 ms · 2026-05-19T16:32:10.464199+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a repetition-aware mixture scaling law that accounts for the decreasing value of repeated target tokens... Deff = (1−h)Dtotal + τ DT, DT = Dtarget (1 + ρ(r)) with ρ(r)=r1(1−e^−(r−1)/r1)
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

optimal repetition... reaching up to 15–20... mixture training tolerates much higher repetition than single-source training

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 9 internal anchors

[1]

Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models

Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin El-Nouby, Joshua M Susskind, and Vimal Thilak. Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models. In International Conference on Machine Learning, pages 204--230. PMLR, 2025

work page 2025
[2]

Mix, don't tune: Bilingual pre-training outperforms hyperparameter search in data-constrained settings

Anonymous. Mix, don't tune: Bilingual pre-training outperforms hyperparameter search in data-constrained settings. Submitted to NeurIPS 2026, 2026

work page 2026
[3]

Introducing claude opus 4.6

Anthropic. Introducing claude opus 4.6. Antropic Annoucements, 2026. URL https://www.anthropic.com/news/claude-opus-4-6

work page 2026
[4]

SmolLM3: smol, multilingual, long-context reasoner

Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Lewis Tunstall, Carlos Miguel Patiño, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gallouédec, Kashif Rasul, Nathan Habib, Clémentine Fourrier, Hynek Kydlicek, Guilherme Penedo, Hugo Larcher, Mathieu Morlon, Vaibhav Srivastav, Joshua Lochner, Xuan-Son Nguyen, Colin Raffel, Lean...

work page 2025
[5]

Scaling laws for forgetting during finetuning with pretraining data injection

Louis B \'e thune, David Grangier, Dan Busbridge, Eleonora Gualdoni, Marco Cuturi, and Pierre Ablin. Scaling laws for forgetting during finetuning with pretraining data injection. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=vWMij23BmQ

work page 2025
[6]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

work page 2020
[7]

Scaling parameter-constrained language models with quality data

Ernie Chang, Matteo Paltenghi, Yang Li, Pin-Jie Lin, Changsheng Zhao, Patrick Huber, Zechun Liu, Rastislav Rabatin, Yangyang Shi, and Vikas Chandra. Scaling parameter-constrained language models with quality data. In Franck Dernoncourt, Daniel Preo t iuc-Pietro, and Anastasia Shimorina, editors, Proceedings of the 2024 Conference on Empirical Methods in N...

work page doi:10.18653/v1/2024.emnlp-industry.8 2024
[8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Nemotron-climb: Clustering-based iterative data mixture bootstrap- ping for language model pre-training.arXiv preprint arXiv:2504.13161, 2025

Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, and Pavlo Molchanov. Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training, 2025. URL https://arxiv.org/abs/2504.13161

work page arXiv 2025
[10]

Essential-web v1.0: 24t tokens of organized web data, 2025

Essential AI , :, Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar, Adarsh Chaluvaraju, Alok Tripathy, Anil Thomas, Ashish Tanwer, Darsh J Shah, Ishaan Shah, Karl Stratos, Khoi Nguyen, Kurt Smith, Michael Callahan, Peter Rushton, Philip Monk, Platon Mazarakis, Saad Jamal, Saurabh Srivastava, Somanshu Singla, and Ashish V...

work page arXiv 2025
[11]

Doge: domain reweighting with generalization estimation

Simin Fan, Matteo Pagliardini, and Martin Jaggi. Doge: domain reweighting with generalization estimation. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

work page 2024
[12]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The P ile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[13]

Emogen: Emotional image content generation with text-to-image diffusion models,

Sachin Goyal, Pratyush Maini, Zachary C. Lipton, Aditi Raghunathan, and J. Zico Kolter. Scaling laws for data filtering—data curation cannot be compute agnostic. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22702--22711, 2024. doi:10.1109/CVPR52733.2024.02142

work page doi:10.1109/cvpr52733.2024.02142 2024
[14]

Task-adaptive pretrained language models via clustered-importance sampling

David Grangier, Simin Fan, Skyler Seto, and Pierre Ablin. Task-adaptive pretrained language models via clustered-importance sampling. In ICLR, 2025

work page 2025
[15]

Textbooks are all you need, June 2023

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Cesar, Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need, June 2023. URL https://...

work page 2023
[16]

Scaling laws and compute-optimal training beyond fixed training durations

Alexander H \"a gele, Elie Bakouch, Atli Kosson, Loubna Ben allal, Leandro Von Werra, and Martin Jaggi. Scaling laws and compute-optimal training beyond fixed training durations. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=Y13gSfTjGr

work page 2024
[17]

Scaling Laws and Interpretability of Learning from Repeated Data

Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, Scott Johnston, Ben Mann, Chris Olah, Catherine Olsson, Dario Amodei, Nicholas Joseph, Jared Kaplan, and Sam McCandlish. Scaling laws and interpretability of learning from repeated data, 2022. URL https://arxiv...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Rae, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...

work page 2022
[19]

The State and Fate of Linguistic Diversity and Inclusion in the NLP World

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state and fate of linguistic diversity and inclusion in the NLP world. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282--6293, Online, July 2020...

work page doi:10.18653/v1/2020.acl-main.560 2020
[20]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[21]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[22]

arXiv preprint arXiv:2402.07871 , year=

Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, Marek Cygan, and Sebastian Jaszczur. Scaling laws for fine-grained mixture of experts, 2024. URL https://arxiv.org/abs/2402.07871

work page arXiv 2024
[23]

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, 2018

work page 2018
[24]

DataComp-LM: In search of the next generation of training sets for language models

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardn...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

From acceleration to saturation: Scaling behavior of bootstrapped language model pretraining

Seng Pei Liew and Takuya Kato. From acceleration to saturation: Scaling behavior of bootstrapped language model pretraining. In NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling, 2025. URL https://openreview.net/forum?id=PhsneSYvWK

work page 2025
[26]

Advanced version of gemini with deep think officially achieves gold-medal standard at the international mathematical olympiad

Thang Luong and Edward Lockhart. Advanced version of gemini with deep think officially achieves gold-medal standard at the international mathematical olympiad. Google DeepMind Blog, 1, 2025

work page 2025
[27]

Rephrasing the web: A recipe for compute and data-efficient language modeling

Pratyush Maini, Skyler Seto, Richard Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly. Rephrasing the web: A recipe for compute and data-efficient language modeling. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14044--14072, 2024

work page 2024
[28]

Scaling data-constrained language models

Niklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=j5BuTrEj35

work page 2023
[29]

Team OLMo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, S...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Ji...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Katrin Erk and Noah A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational ...

work page doi:10.18653/v1/p16-1144 2016
[32]

Paster, M

Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. OpenWebMath : An open dataset of high-quality mathematical web text. arXiv preprint arXiv:2310.06786, 2023

work page arXiv 2023
[33]

The FineWeb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydl \' c ek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale. In Advances in Neural Information Processing Systems, volume 37, 2024

work page 2024
[34]

Fineweb2: One pipeline to scale them all -- adapting pre-training data processing to every language, 2025

Guilherme Penedo, Hynek Kydl \' c ek, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. FineWeb2 : One pipeline to scale them all -- adapting pre-training data processing to every language. arXiv preprint arXiv:2506.20920, 2025

work page arXiv 2025
[35]

Resolving discrepancies in compute-optimal scaling of language models

Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, and Yair Carmon. Resolving discrepancies in compute-optimal scaling of language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=4fSSqpk1sM

work page 2024
[36]

D- CPT law: Domain-specific continual pre-training scaling law for large language models

Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, ZhiqiBai, JiakaiWang, Yuanxing Zhang, Xu Tan, Jie Fu, Jiamang Wang, Lin Qu, Wenbo Su, and Bo Zheng. D- CPT law: Domain-specific continual pre-training scaling law for large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems,...

work page 2024
[37]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 2019

work page 2019
[38]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020

work page 2020
[39]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Proceedings of AAAI, 2020

work page 2020
[40]

Training bilingual lms with data constraints in the targeted language

Skyler Seto, Maartje Ter Hoeve, Richard He Bai, Natalie Schluter, and David Grangier. Training bilingual lms with data constraints in the targeted language. In Findings of the Association for Computational Linguistics: ACL 2025, pages 19096--19122, 2025

work page 2025
[41]

Optimal splitting of language models from mixtures to specialized domains

Skyler Seto, Pierre Ablin, Anastasiia Filippova, Jiayuan Ye, Louis Bethune, Angelos Katharopoulos, and David Grangier. Optimal splitting of language models from mixtures to specialized domains. arXiv preprint arXiv:2603.19149, 2026

work page arXiv 2026
[42]

Scaling laws for optimal data mixtures

Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, and Pierre Ablin. Scaling laws for optimal data mixtures. In NeurIPS, 2025. URL https://arxiv.org/abs/2507.09404

work page arXiv 2025
[43]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama . https://cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B

work page 2023
[45]

peS2o (pretraining efficiently on S2ORC ) dataset

Luca Soldaini and Kyle Lo. peS2o (pretraining efficiently on S2ORC ) dataset. Technical report, Allen Institute for AI, 2023

work page 2023
[46]

doi: 10.18653/v1/2024.acl-long.840

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew Peters, Abhilasha Ravichander, Kyle Ri...

work page doi:10.18653/v1/2024.acl-long.840 2024
[47]

Nemotron- CC : Transforming C ommon C rawl into a refined long-horizon pretraining dataset

Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron- CC : Transforming C ommon C rawl into a refined long-horizon pretraining dataset. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of ...

work page doi:10.18653/v1/2025.acl-long.123 2025
[48]

Scaling laws across model architectures: A comparative analysis of dense and M o E models in large language models

Siqi Wang, Zhengyu Chen, Bei Li, Keqing He, Min Zhang, and Jingang Wang. Scaling laws across model architectures: A comparative analysis of dense and M o E models in large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5583--5595,...

work page doi:10.18653/v1/2024.emnlp-main.319 2024
[49]

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://open...

work page 2022
[50]

Liu, and Matt Gardner

Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In Leon Derczynski, Wei Xu, Alan Ritter, and Tim Baldwin, editors, Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94--106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:10.18653/v1/W17-4413. URL https...

work page doi:10.18653/v1/w17-4413 2017
[51]

15 Advancing Mathematics Research with AI-Driven Formal Proof Search Accelerating scientific research with gemini: Case studies and common techniques

David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, MohammadHossein Bateni, Simina Branzei, Michael P. Brenner, Lin Chen, Ying Feng, Lance Fortnow, Gang Fu, Ziyi Guan, Zahra Hadizadeh, Mohammad T. Hajiaghayi, Mahdi JafariRaviz, Adel Javanmard, Karthik C. S., Ken ichi Kawarabayashi, Ravi Kumar, Silvio Lattanzi, Euiwoong Lee, Yi Li, I...

work page arXiv 2026
[52]

DoReMi : Optimizing data mixtures speeds up language model pretraining

Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. DoReMi : Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[53]

Data mixing laws: Optimizing data mixtures by predicting language modeling performance

Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu. Data mixing laws: Optimizing data mixtures by predicting language modeling performance. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=jjCB27TMK3

work page 2025
[54]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. H ella S wag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Llu \'i s M \`a rquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, Florence, Italy, July 2019. Association for Computatio...

work page doi:10.18653/v1/p19-1472 2019
[55]

When scaling meets LLM finetuning: The effect of data, model and finetuning method

Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. When scaling meets LLM finetuning: The effect of data, model and finetuning method. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5HCnKDeTws

work page 2024

[1] [1]

Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models

Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin El-Nouby, Joshua M Susskind, and Vimal Thilak. Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models. In International Conference on Machine Learning, pages 204--230. PMLR, 2025

work page 2025

[2] [2]

Mix, don't tune: Bilingual pre-training outperforms hyperparameter search in data-constrained settings

Anonymous. Mix, don't tune: Bilingual pre-training outperforms hyperparameter search in data-constrained settings. Submitted to NeurIPS 2026, 2026

work page 2026

[3] [3]

Introducing claude opus 4.6

Anthropic. Introducing claude opus 4.6. Antropic Annoucements, 2026. URL https://www.anthropic.com/news/claude-opus-4-6

work page 2026

[4] [4]

SmolLM3: smol, multilingual, long-context reasoner

Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Lewis Tunstall, Carlos Miguel Patiño, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gallouédec, Kashif Rasul, Nathan Habib, Clémentine Fourrier, Hynek Kydlicek, Guilherme Penedo, Hugo Larcher, Mathieu Morlon, Vaibhav Srivastav, Joshua Lochner, Xuan-Son Nguyen, Colin Raffel, Lean...

work page 2025

[5] [5]

Scaling laws for forgetting during finetuning with pretraining data injection

Louis B \'e thune, David Grangier, Dan Busbridge, Eleonora Gualdoni, Marco Cuturi, and Pierre Ablin. Scaling laws for forgetting during finetuning with pretraining data injection. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=vWMij23BmQ

work page 2025

[6] [6]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

work page 2020

[7] [7]

Scaling parameter-constrained language models with quality data

Ernie Chang, Matteo Paltenghi, Yang Li, Pin-Jie Lin, Changsheng Zhao, Patrick Huber, Zechun Liu, Rastislav Rabatin, Yangyang Shi, and Vikas Chandra. Scaling parameter-constrained language models with quality data. In Franck Dernoncourt, Daniel Preo t iuc-Pietro, and Anastasia Shimorina, editors, Proceedings of the 2024 Conference on Empirical Methods in N...

work page doi:10.18653/v1/2024.emnlp-industry.8 2024

[8] [8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Nemotron-climb: Clustering-based iterative data mixture bootstrap- ping for language model pre-training.arXiv preprint arXiv:2504.13161, 2025

Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, and Pavlo Molchanov. Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training, 2025. URL https://arxiv.org/abs/2504.13161

work page arXiv 2025

[10] [10]

Essential-web v1.0: 24t tokens of organized web data, 2025

Essential AI , :, Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar, Adarsh Chaluvaraju, Alok Tripathy, Anil Thomas, Ashish Tanwer, Darsh J Shah, Ishaan Shah, Karl Stratos, Khoi Nguyen, Kurt Smith, Michael Callahan, Peter Rushton, Philip Monk, Platon Mazarakis, Saad Jamal, Saurabh Srivastava, Somanshu Singla, and Ashish V...

work page arXiv 2025

[11] [11]

Doge: domain reweighting with generalization estimation

Simin Fan, Matteo Pagliardini, and Martin Jaggi. Doge: domain reweighting with generalization estimation. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

work page 2024

[12] [12]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The P ile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020

[13] [13]

Emogen: Emotional image content generation with text-to-image diffusion models,

Sachin Goyal, Pratyush Maini, Zachary C. Lipton, Aditi Raghunathan, and J. Zico Kolter. Scaling laws for data filtering—data curation cannot be compute agnostic. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22702--22711, 2024. doi:10.1109/CVPR52733.2024.02142

work page doi:10.1109/cvpr52733.2024.02142 2024

[14] [14]

Task-adaptive pretrained language models via clustered-importance sampling

David Grangier, Simin Fan, Skyler Seto, and Pierre Ablin. Task-adaptive pretrained language models via clustered-importance sampling. In ICLR, 2025

work page 2025

[15] [15]

Textbooks are all you need, June 2023

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Cesar, Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need, June 2023. URL https://...

work page 2023

[16] [16]

Scaling laws and compute-optimal training beyond fixed training durations

Alexander H \"a gele, Elie Bakouch, Atli Kosson, Loubna Ben allal, Leandro Von Werra, and Martin Jaggi. Scaling laws and compute-optimal training beyond fixed training durations. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=Y13gSfTjGr

work page 2024

[17] [17]

Scaling Laws and Interpretability of Learning from Repeated Data

Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, Scott Johnston, Ben Mann, Chris Olah, Catherine Olsson, Dario Amodei, Nicholas Joseph, Jared Kaplan, and Sam McCandlish. Scaling laws and interpretability of learning from repeated data, 2022. URL https://arxiv...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Rae, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...

work page 2022

[19] [19]

The State and Fate of Linguistic Diversity and Inclusion in the NLP World

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state and fate of linguistic diversity and inclusion in the NLP world. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282--6293, Online, July 2020...

work page doi:10.18653/v1/2020.acl-main.560 2020

[20] [20]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[21] [21]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[22] [22]

arXiv preprint arXiv:2402.07871 , year=

Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, Marek Cygan, and Sebastian Jaszczur. Scaling laws for fine-grained mixture of experts, 2024. URL https://arxiv.org/abs/2402.07871

work page arXiv 2024

[23] [23]

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, 2018

work page 2018

[24] [24]

DataComp-LM: In search of the next generation of training sets for language models

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardn...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

From acceleration to saturation: Scaling behavior of bootstrapped language model pretraining

Seng Pei Liew and Takuya Kato. From acceleration to saturation: Scaling behavior of bootstrapped language model pretraining. In NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling, 2025. URL https://openreview.net/forum?id=PhsneSYvWK

work page 2025

[26] [26]

Advanced version of gemini with deep think officially achieves gold-medal standard at the international mathematical olympiad

Thang Luong and Edward Lockhart. Advanced version of gemini with deep think officially achieves gold-medal standard at the international mathematical olympiad. Google DeepMind Blog, 1, 2025

work page 2025

[27] [27]

Rephrasing the web: A recipe for compute and data-efficient language modeling

Pratyush Maini, Skyler Seto, Richard Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly. Rephrasing the web: A recipe for compute and data-efficient language modeling. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14044--14072, 2024

work page 2024

[28] [28]

Scaling data-constrained language models

Niklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=j5BuTrEj35

work page 2023

[29] [29]

Team OLMo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, S...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Ji...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Katrin Erk and Noah A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational ...

work page doi:10.18653/v1/p16-1144 2016

[32] [32]

Paster, M

Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. OpenWebMath : An open dataset of high-quality mathematical web text. arXiv preprint arXiv:2310.06786, 2023

work page arXiv 2023

[33] [33]

The FineWeb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydl \' c ek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale. In Advances in Neural Information Processing Systems, volume 37, 2024

work page 2024

[34] [34]

Fineweb2: One pipeline to scale them all -- adapting pre-training data processing to every language, 2025

Guilherme Penedo, Hynek Kydl \' c ek, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. FineWeb2 : One pipeline to scale them all -- adapting pre-training data processing to every language. arXiv preprint arXiv:2506.20920, 2025

work page arXiv 2025

[35] [35]

Resolving discrepancies in compute-optimal scaling of language models

Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, and Yair Carmon. Resolving discrepancies in compute-optimal scaling of language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=4fSSqpk1sM

work page 2024

[36] [36]

D- CPT law: Domain-specific continual pre-training scaling law for large language models

Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, ZhiqiBai, JiakaiWang, Yuanxing Zhang, Xu Tan, Jie Fu, Jiamang Wang, Lin Qu, Wenbo Su, and Bo Zheng. D- CPT law: Domain-specific continual pre-training scaling law for large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems,...

work page 2024

[37] [37]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 2019

work page 2019

[38] [38]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020

work page 2020

[39] [39]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Proceedings of AAAI, 2020

work page 2020

[40] [40]

Training bilingual lms with data constraints in the targeted language

Skyler Seto, Maartje Ter Hoeve, Richard He Bai, Natalie Schluter, and David Grangier. Training bilingual lms with data constraints in the targeted language. In Findings of the Association for Computational Linguistics: ACL 2025, pages 19096--19122, 2025

work page 2025

[41] [41]

Optimal splitting of language models from mixtures to specialized domains

Skyler Seto, Pierre Ablin, Anastasiia Filippova, Jiayuan Ye, Louis Bethune, Angelos Katharopoulos, and David Grangier. Optimal splitting of language models from mixtures to specialized domains. arXiv preprint arXiv:2603.19149, 2026

work page arXiv 2026

[42] [42]

Scaling laws for optimal data mixtures

Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, and Pierre Ablin. Scaling laws for optimal data mixtures. In NeurIPS, 2025. URL https://arxiv.org/abs/2507.09404

work page arXiv 2025

[43] [43]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[44] [44]

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama . https://cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B

work page 2023

[45] [45]

peS2o (pretraining efficiently on S2ORC ) dataset

Luca Soldaini and Kyle Lo. peS2o (pretraining efficiently on S2ORC ) dataset. Technical report, Allen Institute for AI, 2023

work page 2023

[46] [46]

doi: 10.18653/v1/2024.acl-long.840

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew Peters, Abhilasha Ravichander, Kyle Ri...

work page doi:10.18653/v1/2024.acl-long.840 2024

[47] [47]

Nemotron- CC : Transforming C ommon C rawl into a refined long-horizon pretraining dataset

Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron- CC : Transforming C ommon C rawl into a refined long-horizon pretraining dataset. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of ...

work page doi:10.18653/v1/2025.acl-long.123 2025

[48] [48]

Scaling laws across model architectures: A comparative analysis of dense and M o E models in large language models

Siqi Wang, Zhengyu Chen, Bei Li, Keqing He, Min Zhang, and Jingang Wang. Scaling laws across model architectures: A comparative analysis of dense and M o E models in large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5583--5595,...

work page doi:10.18653/v1/2024.emnlp-main.319 2024

[49] [49]

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://open...

work page 2022

[50] [50]

Liu, and Matt Gardner

Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In Leon Derczynski, Wei Xu, Alan Ritter, and Tim Baldwin, editors, Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94--106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:10.18653/v1/W17-4413. URL https...

work page doi:10.18653/v1/w17-4413 2017

[51] [51]

15 Advancing Mathematics Research with AI-Driven Formal Proof Search Accelerating scientific research with gemini: Case studies and common techniques

David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, MohammadHossein Bateni, Simina Branzei, Michael P. Brenner, Lin Chen, Ying Feng, Lance Fortnow, Gang Fu, Ziyi Guan, Zahra Hadizadeh, Mohammad T. Hajiaghayi, Mahdi JafariRaviz, Adel Javanmard, Karthik C. S., Ken ichi Kawarabayashi, Ravi Kumar, Silvio Lattanzi, Euiwoong Lee, Yi Li, I...

work page arXiv 2026

[52] [52]

DoReMi : Optimizing data mixtures speeds up language model pretraining

Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. DoReMi : Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36, 2023

work page 2023

[53] [53]

Data mixing laws: Optimizing data mixtures by predicting language modeling performance

Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu. Data mixing laws: Optimizing data mixtures by predicting language modeling performance. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=jjCB27TMK3

work page 2025

[54] [54]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. H ella S wag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Llu \'i s M \`a rquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, Florence, Italy, July 2019. Association for Computatio...

work page doi:10.18653/v1/p19-1472 2019

[55] [55]

When scaling meets LLM finetuning: The effect of data, model and finetuning method

Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. When scaling meets LLM finetuning: The effect of data, model and finetuning method. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5HCnKDeTws

work page 2024