Scaling Laws for Mixture Pretraining Under Data Constraints
Pith reviewed 2026-05-19 16:32 UTC · model grok-4.3
The pith
Scarce target data can be repeated 15-20 times in mixtures with generic data before performance plateaus, unlike single-source training, according to a new repetition-aware scaling law.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across more than 2,000 training runs the central claim is that repetition of target tokens is the primary driver of target-domain performance in mixture pretraining, that mixtures can safely reuse scarce target corpora 15-20 times, and that a repetition-aware scaling law capturing the decreasing marginal value of repeated target tokens together with the regularizing contribution of generic data can be used to compute effective mixture ratios directly from target data size, compute budget, and model scale.
What carries the argument
The repetition-aware mixture scaling law, which modifies standard scaling relations to include a term for the loss of value on each repeated target token and an additive regularization benefit from generic data.
If this is right
- Mixture ratios can be chosen by solving the scaling law rather than by running many separate training experiments.
- Target data repetition can be set higher when generic data is present, reducing the total volume of unique target tokens needed.
- Optimal repetition levels rise with model scale and compute budget but fall as the absolute size of the target corpus increases.
- The same law applies across multilingual, domain-specific, and quality-filtered target sources.
Where Pith is reading between the lines
- The framework could be adapted to decide how much synthetic target data to generate when real data is exhausted.
- Similar repetition-aware laws might apply to continued pretraining or instruction-tuning stages where domain data is also limited.
- Hardware-aware versions of the law could incorporate per-token latency differences between target and generic batches.
Load-bearing premise
The repetition tolerance and scaling behavior observed across the tested model sizes and data types will continue to hold when the same approach is applied at larger scales or with different data distributions.
What would settle it
A controlled experiment at substantially larger model scale or with a new data type in which the measured optimal repetition count for a given target size and compute budget deviates by more than 30 percent from the value predicted by the scaling law.
read the original abstract
As language models scale, the amount of data they require grows -- yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable target data with abundant generic data, which presents a fundamental trade-off: too little target data in the mixture underexposes the model to the target domain, while too much target data repeats the same examples excessively, yielding diminishing returns and eventual overfitting. We study this trade-off across more than 2,000 language-model training runs spanning multiple model and target dataset sizes, as well as several data types, including multilingual, domain-specific, and quality-filtered mixtures. Across all settings, we find that repetition is a central driver of target-domain performance, and that mixture training tolerates much higher repetition than single-source training: scarce target corpora can be reused 15-20 times, with the optimal number of repetitions depending on the target data size, compute budget, and model scale. Next, we introduce a repetition-aware mixture scaling law that accounts for the decreasing value of repeated target tokens and the regularizing role of generic data. Optimizing the scaling law provides a principled way to compute effective mixture configurations, yielding practical mixture recommendations for pretraining under data constraints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines the trade-off in pretraining language models when mixing scarce target-domain data (e.g., low-resource languages or specialized domains) with abundant generic data. Across more than 2,000 training runs spanning model sizes, target dataset sizes, and data types (multilingual, domain-specific, quality-filtered), it reports that repetition of target tokens is a central driver of performance and that mixture training tolerates substantially higher repetition (optimal 15-20x) than single-source training, with the optimum depending on target size, compute budget, and model scale. It then introduces a repetition-aware mixture scaling law that models the decreasing value of repeated target tokens together with the regularizing effect of generic data; optimizing this law is claimed to yield principled mixture recommendations under data constraints.
Significance. If the scaling law is shown to be predictive rather than a post-hoc fit, the work would offer both empirical guidance and a practical tool for pretraining under realistic data scarcity, a common constraint in multilingual and domain-specific settings. The scale of the experimental campaign (>2000 runs) provides a solid empirical foundation for the repetition-tolerance claims and the dependence on target size, compute, and model scale.
major comments (2)
- [Scaling law section (post-experiments)] The repetition-aware mixture scaling law is introduced after the experimental results are presented. It is unclear whether its functional form and coefficients were derived independently (e.g., from first principles or a separate theoretical argument) or selected/tuned to match the measured performance curves from the same >2000 runs. If the latter, the claim that the law provides a 'principled way to compute effective mixture configurations' reduces to a descriptive fit whose extrapolation to new regimes or held-out repetition ratios remains untested.
- [Experimental results and figures] No error bars, confidence intervals, or exclusion criteria are reported for the performance measurements across the 2000+ runs. This makes it difficult to assess the statistical reliability of the reported optimal repetition counts (15-20x) and the dependence on target size, compute, and scale.
minor comments (2)
- [Abstract and introduction] The abstract and main text would benefit from an explicit statement of how the scaling-law parameters were obtained (e.g., least-squares fit on which subset of runs, or closed-form derivation).
- [Scaling law definition] Notation for the scaling law (e.g., symbols for repetition factor, effective tokens, regularization term) should be introduced with a clear table or equation reference to avoid ambiguity when the law is later optimized.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we plan to make.
read point-by-point responses
-
Referee: [Scaling law section (post-experiments)] The repetition-aware mixture scaling law is introduced after the experimental results are presented. It is unclear whether its functional form and coefficients were derived independently (e.g., from first principles or a separate theoretical argument) or selected/tuned to match the measured performance curves from the same >2000 runs. If the latter, the claim that the law provides a 'principled way to compute effective mixture configurations' reduces to a descriptive fit whose extrapolation to new regimes or held-out repetition ratios remains untested.
Authors: The functional form is motivated by established scaling-law structures for diminishing returns under data repetition (building on prior work such as Hoffmann et al.) together with an explicit term for the regularizing contribution of generic data. Coefficients were obtained by fitting to the full set of runs. We acknowledge that this renders the law primarily empirical rather than derived from first principles. In the revision we will clarify this distinction in the text, add a dedicated subsection on model derivation, and report explicit held-out validation: we will reserve a subset of repetition ratios and target-size/compute combinations, refit on the remainder, and demonstrate that the law still predicts optimal mixtures with low error on the held-out points. revision: yes
-
Referee: [Experimental results and figures] No error bars, confidence intervals, or exclusion criteria are reported for the performance measurements across the 2000+ runs. This makes it difficult to assess the statistical reliability of the reported optimal repetition counts (15-20x) and the dependence on target size, compute, and scale.
Authors: We agree that error bars and explicit exclusion criteria would strengthen statistical assessment. Because of the computational cost of more than 2,000 full training runs, we did not repeat every configuration with independent random seeds. Nevertheless, the reported optimal repetition range (15-20x) and its dependence on target size, compute, and scale emerge consistently across multiple data regimes (multilingual, domain-specific, quality-filtered). In the revised manuscript we will (i) state the exclusion criteria used for outlier runs, (ii) add error bars to the primary figures based on repeated-seed subsets for representative model and data sizes, and (iii) include a short variability analysis quantifying the standard deviation observed across those repeats. revision: yes
Circularity Check
Repetition-aware scaling law fitted to experimental runs rather than independently derived from first principles
specific steps
-
fitted input called prediction
[Abstract]
"Next, we introduce a repetition-aware mixture scaling law that accounts for the decreasing value of repeated target tokens and the regularizing role of generic data. Optimizing the scaling law provides a principled way to compute effective mixture configurations, yielding practical mixture recommendations for pretraining under data constraints."
The law is introduced after describing the >2000 runs across model sizes and data types. Its parameters are selected or tuned so that the law reproduces the measured target-domain performance as a function of repetition and mixture ratio; the subsequent 'optimization' therefore outputs recommendations that are statistically forced by the fit rather than predicted from an external derivation.
full rationale
The paper conducts >2000 runs, then introduces a scaling law whose functional form accounts for repetition value and generic-data regularization. Optimizing this law is presented as yielding principled mixture recommendations. Because the law is introduced after the runs and its parameters are chosen to match the observed performance curves (as implied by the experimental scale and the claim of 'accounting for' the patterns), the recommendations reduce to a descriptive fit on the same data rather than an independent prediction. This is a moderate instance of fitted-input-called-prediction; the central claim still contains empirical content but the 'principled' status is not independently validated.
Axiom & Free-Parameter Ledger
free parameters (1)
- optimal repetition count
axioms (1)
- domain assumption Generic data provides regularization that mitigates overfitting from repeated target data.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a repetition-aware mixture scaling law that accounts for the decreasing value of repeated target tokens... Deff = (1−h)Dtotal + τ DT, DT = Dtarget (1 + ρ(r)) with ρ(r)=r1(1−e^−(r−1)/r1)
-
IndisputableMonolith/Foundation/DimensionForcing.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
optimal repetition... reaching up to 15–20... mixture training tolerates much higher repetition than single-source training
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models
Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin El-Nouby, Joshua M Susskind, and Vimal Thilak. Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models. In International Conference on Machine Learning, pages 204--230. PMLR, 2025
work page 2025
-
[2]
Anonymous. Mix, don't tune: Bilingual pre-training outperforms hyperparameter search in data-constrained settings. Submitted to NeurIPS 2026, 2026
work page 2026
-
[3]
Anthropic. Introducing claude opus 4.6. Antropic Annoucements, 2026. URL https://www.anthropic.com/news/claude-opus-4-6
work page 2026
-
[4]
SmolLM3: smol, multilingual, long-context reasoner
Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Lewis Tunstall, Carlos Miguel Patiño, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gallouédec, Kashif Rasul, Nathan Habib, Clémentine Fourrier, Hynek Kydlicek, Guilherme Penedo, Hugo Larcher, Mathieu Morlon, Vaibhav Srivastav, Joshua Lochner, Xuan-Son Nguyen, Colin Raffel, Lean...
work page 2025
-
[5]
Scaling laws for forgetting during finetuning with pretraining data injection
Louis B \'e thune, David Grangier, Dan Busbridge, Eleonora Gualdoni, Marco Cuturi, and Pierre Ablin. Scaling laws for forgetting during finetuning with pretraining data injection. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=vWMij23BmQ
work page 2025
-
[6]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020
work page 2020
-
[7]
Scaling parameter-constrained language models with quality data
Ernie Chang, Matteo Paltenghi, Yang Li, Pin-Jie Lin, Changsheng Zhao, Patrick Huber, Zechun Liu, Rastislav Rabatin, Yangyang Shi, and Vikas Chandra. Scaling parameter-constrained language models with quality data. In Franck Dernoncourt, Daniel Preo t iuc-Pietro, and Anastasia Shimorina, editors, Proceedings of the 2024 Conference on Empirical Methods in N...
-
[8]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, and Pavlo Molchanov. Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training, 2025. URL https://arxiv.org/abs/2504.13161
-
[10]
Essential-web v1.0: 24t tokens of organized web data, 2025
Essential AI , :, Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar, Adarsh Chaluvaraju, Alok Tripathy, Anil Thomas, Ashish Tanwer, Darsh J Shah, Ishaan Shah, Karl Stratos, Khoi Nguyen, Kurt Smith, Michael Callahan, Peter Rushton, Philip Monk, Platon Mazarakis, Saad Jamal, Saurabh Srivastava, Somanshu Singla, and Ashish V...
-
[11]
Doge: domain reweighting with generalization estimation
Simin Fan, Matteo Pagliardini, and Martin Jaggi. Doge: domain reweighting with generalization estimation. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024
work page 2024
-
[12]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The P ile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[13]
Emogen: Emotional image content generation with text-to-image diffusion models,
Sachin Goyal, Pratyush Maini, Zachary C. Lipton, Aditi Raghunathan, and J. Zico Kolter. Scaling laws for data filtering—data curation cannot be compute agnostic. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22702--22711, 2024. doi:10.1109/CVPR52733.2024.02142
-
[14]
Task-adaptive pretrained language models via clustered-importance sampling
David Grangier, Simin Fan, Skyler Seto, and Pierre Ablin. Task-adaptive pretrained language models via clustered-importance sampling. In ICLR, 2025
work page 2025
-
[15]
Textbooks are all you need, June 2023
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Cesar, Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need, June 2023. URL https://...
work page 2023
-
[16]
Scaling laws and compute-optimal training beyond fixed training durations
Alexander H \"a gele, Elie Bakouch, Atli Kosson, Loubna Ben allal, Leandro Von Werra, and Martin Jaggi. Scaling laws and compute-optimal training beyond fixed training durations. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=Y13gSfTjGr
work page 2024
-
[17]
Scaling Laws and Interpretability of Learning from Repeated Data
Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, Scott Johnston, Ben Mann, Chris Olah, Catherine Olsson, Dario Amodei, Nicholas Joseph, Jared Kaplan, and Sam McCandlish. Scaling laws and interpretability of learning from repeated data, 2022. URL https://arxiv...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...
work page 2022
-
[19]
The State and Fate of Linguistic Diversity and Inclusion in the NLP World
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state and fate of linguistic diversity and inclusion in the NLP world. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282--6293, Online, July 2020...
-
[20]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[21]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[22]
arXiv preprint arXiv:2402.07871 , year=
Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, Marek Cygan, and Sebastian Jaszczur. Scaling laws for fine-grained mixture of experts, 2024. URL https://arxiv.org/abs/2402.07871
-
[23]
Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, 2018
work page 2018
-
[24]
DataComp-LM: In search of the next generation of training sets for language models
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardn...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
From acceleration to saturation: Scaling behavior of bootstrapped language model pretraining
Seng Pei Liew and Takuya Kato. From acceleration to saturation: Scaling behavior of bootstrapped language model pretraining. In NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling, 2025. URL https://openreview.net/forum?id=PhsneSYvWK
work page 2025
-
[26]
Thang Luong and Edward Lockhart. Advanced version of gemini with deep think officially achieves gold-medal standard at the international mathematical olympiad. Google DeepMind Blog, 1, 2025
work page 2025
-
[27]
Rephrasing the web: A recipe for compute and data-efficient language modeling
Pratyush Maini, Skyler Seto, Richard Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly. Rephrasing the web: A recipe for compute and data-efficient language modeling. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14044--14072, 2024
work page 2024
-
[28]
Scaling data-constrained language models
Niklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=j5BuTrEj35
work page 2023
-
[29]
Team OLMo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, S...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Ji...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Katrin Erk and Noah A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational ...
- [32]
-
[33]
The FineWeb datasets: Decanting the web for the finest text data at scale
Guilherme Penedo, Hynek Kydl \' c ek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale. In Advances in Neural Information Processing Systems, volume 37, 2024
work page 2024
-
[34]
Guilherme Penedo, Hynek Kydl \' c ek, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. FineWeb2 : One pipeline to scale them all -- adapting pre-training data processing to every language. arXiv preprint arXiv:2506.20920, 2025
-
[35]
Resolving discrepancies in compute-optimal scaling of language models
Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, and Yair Carmon. Resolving discrepancies in compute-optimal scaling of language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=4fSSqpk1sM
work page 2024
-
[36]
D- CPT law: Domain-specific continual pre-training scaling law for large language models
Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, ZhiqiBai, JiakaiWang, Yuanxing Zhang, Xu Tan, Jie Fu, Jiamang Wang, Lin Qu, Wenbo Su, and Bo Zheng. D- CPT law: Domain-specific continual pre-training scaling law for large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems,...
work page 2024
-
[37]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 2019
work page 2019
-
[38]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020
work page 2020
-
[39]
Winogrande: An adversarial winograd schema challenge at scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Proceedings of AAAI, 2020
work page 2020
-
[40]
Training bilingual lms with data constraints in the targeted language
Skyler Seto, Maartje Ter Hoeve, Richard He Bai, Natalie Schluter, and David Grangier. Training bilingual lms with data constraints in the targeted language. In Findings of the Association for Computational Linguistics: ACL 2025, pages 19096--19122, 2025
work page 2025
-
[41]
Optimal splitting of language models from mixtures to specialized domains
Skyler Seto, Pierre Ablin, Anastasiia Filippova, Jiayuan Ye, Louis Bethune, Angelos Katharopoulos, and David Grangier. Optimal splitting of language models from mixtures to specialized domains. arXiv preprint arXiv:2603.19149, 2026
-
[42]
Scaling laws for optimal data mixtures
Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, and Pierre Ablin. Scaling laws for optimal data mixtures. In NeurIPS, 2025. URL https://arxiv.org/abs/2507.09404
-
[43]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[44]
SlimPajama: A 627B token cleaned and deduplicated version of RedPajama
Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama . https://cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B
work page 2023
-
[45]
peS2o (pretraining efficiently on S2ORC ) dataset
Luca Soldaini and Kyle Lo. peS2o (pretraining efficiently on S2ORC ) dataset. Technical report, Allen Institute for AI, 2023
work page 2023
-
[46]
doi: 10.18653/v1/2024.acl-long.840
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew Peters, Abhilasha Ravichander, Kyle Ri...
-
[47]
Nemotron- CC : Transforming C ommon C rawl into a refined long-horizon pretraining dataset
Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron- CC : Transforming C ommon C rawl into a refined long-horizon pretraining dataset. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of ...
-
[48]
Siqi Wang, Zhengyu Chen, Bei Li, Keqing He, Min Zhang, and Jingang Wang. Scaling laws across model architectures: A comparative analysis of dense and M o E models in large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5583--5595,...
-
[49]
Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://open...
work page 2022
-
[50]
Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In Leon Derczynski, Wei Xu, Alan Ritter, and Tim Baldwin, editors, Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94--106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:10.18653/v1/W17-4413. URL https...
-
[51]
David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, MohammadHossein Bateni, Simina Branzei, Michael P. Brenner, Lin Chen, Ying Feng, Lance Fortnow, Gang Fu, Ziyi Guan, Zahra Hadizadeh, Mohammad T. Hajiaghayi, Mahdi JafariRaviz, Adel Javanmard, Karthik C. S., Ken ichi Kawarabayashi, Ravi Kumar, Silvio Lattanzi, Euiwoong Lee, Yi Li, I...
-
[52]
DoReMi : Optimizing data mixtures speeds up language model pretraining
Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. DoReMi : Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36, 2023
work page 2023
-
[53]
Data mixing laws: Optimizing data mixtures by predicting language modeling performance
Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu. Data mixing laws: Optimizing data mixtures by predicting language modeling performance. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=jjCB27TMK3
work page 2025
-
[54]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. H ella S wag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Llu \'i s M \`a rquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, Florence, Italy, July 2019. Association for Computatio...
-
[55]
When scaling meets LLM finetuning: The effect of data, model and finetuning method
Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. When scaling meets LLM finetuning: The effect of data, model and finetuning method. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5HCnKDeTws
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.