Recognition: unknown
Prescriptive Scaling Laws for Data Constrained Training
Pith reviewed 2026-05-09 14:08 UTC · model grok-4.3
The pith
Excess loss from repeating data is captured by a simple penalty, changing optimal compute allocation from repetition to model capacity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a scaling law for data-constrained training by adding a simple overfitting penalty to the standard Chinchilla form that accounts for the excess loss from token repetition. This penalty term accurately fits observed model behavior across sizes and configurations. The resulting law predicts that after a certain point, increasing the number of repetitions raises loss more than it saves compute, so optimal allocation shifts to training larger models instead. We validate that configurations chosen according to this law outperform standard practice in data-limited settings, and that high weight decay lowers the penalty coefficient substantially.
What carries the argument
An additive overfitting penalty with a single coefficient that isolates the effect of data repetition on excess loss.
If this is right
- In data-constrained regimes, compute is better allocated to increasing model parameters rather than additional epochs on repeated data once a threshold is reached.
- Following the recommended allocation from the scaling law yields lower final loss than repeating data maximally.
- The overfitting penalty coefficient can be used to quantify and compare the effectiveness of regularization techniques such as weight decay.
- Strong weight decay reduces the penalty by about 70%, explaining why larger decay values are optimal when data is limited.
Where Pith is reading between the lines
- This framework could be extended to decide between repeating data and generating synthetic data when real data is exhausted.
- It implies that monitoring the overfitting coefficient during training could guide dynamic adjustments to model size or regularization.
- The law may apply to other domains like vision or reinforcement learning where data repetition is common.
Load-bearing premise
The excess loss from data repetition follows a simple additive form whose coefficient does not vary with model size or specific training setup.
What would settle it
Observe whether the measured excess loss on repeated datasets matches the predicted additive penalty for models of varying sizes without requiring adjustments to the coefficient for each size.
Figures
read the original abstract
Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adopted Chinchilla scaling law assumes every training token is unique. This limits its ability to guide pretraining decisions in data-constrained regimes. We model the excess loss under repetition with a simple additive overfitting penalty and find that it accurately describes model behavior. Our scaling law yields qualitatively new compute-optimal allocation advice. Beyond a point, further repetition is counterproductive and compute is better spent on model capacity. We show that following our law's recommended configuration improves performance in data-constrained regimes. Finally, because our one-parameter form isolates overfitting in a single coefficient, it enables direct comparison across training configurations. As a case study, we show that strong weight decay ($\lambda=1.0$) reduces this coefficient by approximately 70%, providing a scaling-law explanation for recent findings that optimal weight decay in data-constrained regimes is an order of magnitude larger than standard practice.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends the Chinchilla scaling law with a simple additive one-parameter overfitting penalty to model excess loss from data repetition in data-constrained regimes. It claims this form accurately describes observed behavior, yields qualitatively new compute-optimal advice (beyond a repetition threshold, allocate compute to model capacity rather than further repetition), demonstrates empirical performance gains from following the derived allocations, and shows that strong weight decay reduces the overfitting coefficient by ~70%, providing a scaling-law account for higher optimal regularization in such settings. The single coefficient is presented as enabling direct cross-configuration comparisons.
Significance. If the one-parameter overfitting term generalizes across scales, optimizers, and datasets, the work would offer a practical, interpretable tool for data-limited pretraining that shifts allocation strategy away from pure Chinchilla optima and supplies a quantitative explanation for recent empirical findings on weight decay. The simplicity of the form is a strength for enabling comparisons, though the prescriptive claims rest on the stability of the fitted coefficient.
major comments (2)
- [Abstract] The central prescriptive advice (stop repeating past a threshold and spend on capacity) and the empirical improvement claim depend on the overfitting coefficient generalizing sufficiently that the derived optimum does not shift materially. The abstract asserts that the form 'accurately describes model behavior' and enables cross-configuration comparison, but no explicit cross-validation, sensitivity analysis, or stability checks across model sizes/configurations/datasets are described; if the coefficient varies by more than ~20-30%, the repetition-vs-capacity crossover moves and the recommended configurations cease to be optimal.
- [Abstract] Validation of the scaling law appears to rely on fitting the single overfitting coefficient to the same data used to assess its descriptive accuracy, creating circularity that weakens the claim that the law 'accurately describes model behavior' independently of the fitting process. A held-out evaluation or out-of-distribution test of the coefficient's predictive power would be needed to support the prescriptive use.
minor comments (1)
- [Abstract] The abstract would be strengthened by briefly stating the range of model sizes, datasets, and repetition factors used to fit and validate the law.
Simulated Author's Rebuttal
We thank the referee for the insightful comments on our extension of Chinchilla scaling laws to data-constrained regimes via a one-parameter overfitting penalty. The feedback highlights valid points about the stability of the coefficient and validation rigor, which we address directly below. We maintain that the simple additive form provides practical prescriptive value, but agree that additional analyses will strengthen the claims.
read point-by-point responses
-
Referee: [Abstract] The central prescriptive advice (stop repeating past a threshold and spend on capacity) and the empirical improvement claim depend on the overfitting coefficient generalizing sufficiently that the derived optimum does not shift materially. The abstract asserts that the form 'accurately describes model behavior' and enables cross-configuration comparison, but no explicit cross-validation, sensitivity analysis, or stability checks across model sizes/configurations/datasets are described; if the coefficient varies by more than ~20-30%, the repetition-vs-capacity crossover moves and the recommended configurations cease to be optimal.
Authors: We appreciate this emphasis on generalization. Our experiments already span model sizes from 100M to over 1B parameters and a range of repetition factors, with the fitted coefficient showing consistency within this regime. However, we agree that an explicit sensitivity analysis is warranted to support the prescriptive advice. In the revised manuscript, we will add a dedicated subsection that varies the overfitting coefficient by up to ±30% around the fitted value and recomputes the optimal allocation curves. This will demonstrate that the qualitative recommendation—shifting compute from repetition to capacity beyond a threshold—remains stable, thereby addressing the concern that small variations could invalidate the advice. revision: partial
-
Referee: [Abstract] Validation of the scaling law appears to rely on fitting the single overfitting coefficient to the same data used to assess its descriptive accuracy, creating circularity that weakens the claim that the law 'accurately describes model behavior' independently of the fitting process. A held-out evaluation or out-of-distribution test of the coefficient's predictive power would be needed to support the prescriptive use.
Authors: The referee is correct that our current validation fits the coefficient to the observed excess losses and then assesses fit quality on the same curves, which introduces a degree of circularity. To strengthen the claim of independent descriptive accuracy, we will revise the manuscript to include a held-out evaluation: the coefficient will be fitted exclusively on a subset of configurations (e.g., smaller models and lower repetition factors) and then used to predict excess loss on held-out larger models and unseen repetition schedules. We will report the prediction error on these held-out points to quantify the law's out-of-sample performance and support its use for prescriptive allocation. revision: yes
Circularity Check
No significant circularity detected; derivation is empirical modeling with external validation
full rationale
The paper extends the Chinchilla scaling law by introducing an additive overfitting penalty term with a single fitted coefficient to capture repetition effects in data-constrained regimes. This is explicitly presented as a modeling choice ('we model the excess loss under repetition with a simple additive overfitting penalty and find that it accurately describes model behavior'), fitted to data, and then optimized to yield allocation advice. The advice is validated by showing performance improvements when following the recommended configurations, and the coefficient is used for cross-configuration comparisons (e.g., weight decay effects). No equations or steps reduce the central result to its inputs by construction, no self-citation chains justify load-bearing premises, and no fitted parameter is renamed as an independent prediction. The form is not claimed as first-principles but as an empirical fit whose generalization is tested via case studies. This is standard non-circular empirical scaling law construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- overfitting penalty coefficient
axioms (1)
- domain assumption Excess loss from token repetition can be modeled as a simple additive penalty that is independent of the usual compute and data scaling terms.
Reference graph
Works this paper leans on
-
[1]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
-
[2]
Scaling Laws for Neural Language Models
Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[3]
Advances in Neural Information Processing Systems , volume=
Scaling data-constrained language models , author=. Advances in Neural Information Processing Systems , volume=
-
[4]
Advances in Neural Information Processing Systems , pages=
Training compute-optimal large language models , author=. Advances in Neural Information Processing Systems , pages=
-
[5]
The Fourteenth International Conference on Learning Representations , year=
Pre-training under infinite compute , author=. The Fourteenth International Conference on Learning Representations , year=
-
[6]
arXiv preprint arXiv:2205.10487 , year=
Scaling laws and interpretability of learning from repeated data , author=. arXiv preprint arXiv:2205.10487 , year=
-
[7]
Advances in Neural Information Processing Systems , volume=
The fineweb datasets: Decanting the web for the finest text data at scale , author=. Advances in Neural Information Processing Systems , volume=
-
[8]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[9]
Olmo 3 , author=. arXiv preprint arXiv:2512.13961 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martin Blazquez and Guilherme Penedo and Lewis Tunstall and Andr. Smol. Second Conference on Language Modeling , year=
-
[11]
Advances in neural information processing systems , volume=
Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=
-
[12]
The Thirteenth International Conference on Learning Representations , year=
(Mis)Fitting Scaling Laws: A Survey of Scaling Law Fitting Techniques in Deep Learning , author=. The Thirteenth International Conference on Learning Representations , year=
-
[13]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Findings of the Association for Computational Linguistics: NAACL 2025 , pages=
Olmes: A standard for language model evaluations , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=
2025
-
[15]
Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation , author=
-
[16]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. arXiv:1803.05457v1 , year =
work page internal anchor Pith review arXiv
-
[17]
Forty-first International Conference on Machine Learning , year=
Scaling Laws for Fine-Grained Mixture of Experts , author=. Forty-first International Conference on Machine Learning , year=
-
[18]
Advances in Neural Information Processing Systems , volume=
Likelihood-based diffusion language models , author=. Advances in Neural Information Processing Systems , volume=
-
[19]
Commonsenseqa: A question answering challenge targeting commonsense knowledge , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=
2019
-
[20]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Reproducible scaling laws for contrastive language-image learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[21]
Proceedings of the 57th annual meeting of the association for computational linguistics , pages=
Hellaswag: Can a machine really finish your sentence? , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=
-
[22]
Communications of the ACM , volume=
Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=
2021
-
[23]
Proceedings of the AAAI conference on artificial intelligence , volume=
Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[24]
Social IQa: Commonsense reasoning about social interactions , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=
2019
-
[25]
Proceedings of the 3rd Workshop on Noisy User-generated Text , pages=
Crowdsourcing multiple choice science questions , author=. Proceedings of the 3rd Workshop on Noisy User-generated Text , pages=
-
[26]
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=
A dataset of information-seeking questions and answers anchored in research papers , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=
2021
-
[27]
Lab-bench: Measuring capabilities of language models for biology research , author=. arXiv preprint arXiv:2407.10362 , year=
-
[28]
Conference on health, inference, and learning , pages=
Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering , author=. Conference on health, inference, and learning , pages=. 2022 , organization=
2022
-
[29]
Applied Sciences , volume=
What disease does this patient have? a large-scale open domain question answering dataset from medical exams , author=. Applied Sciences , volume=. 2021 , publisher=
2021
-
[30]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Sciriff: A resource to enhance language model instruction-following over scientific literature , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
2025
-
[31]
Transactions of the Association for Computational Linguistics , volume=
Coqa: A conversational question answering challenge , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=
2019
-
[32]
DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=
2019
-
[33]
https://aclanthology.org/ Q19-1026/
Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...
-
[34]
Proceedings of the 2016 conference on empirical methods in natural language processing , pages=
Squad: 100,000+ questions for machine comprehension of text , author=. Proceedings of the 2016 conference on empirical methods in natural language processing , pages=
2016
-
[35]
Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=
The LAMBADA dataset: Word prediction requiring a broad discourse context , author=. Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=
-
[36]
The Thirteenth International Conference on Learning Representations , year=
Scaling Laws for Precision , author=. The Thirteenth International Conference on Learning Representations , year=
-
[37]
200,000+ Jeopardy! Questions , author=
-
[38]
International Conference on Machine Learning , pages=
Scaling laws for generative mixed-modal language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=
2023
-
[39]
Forty-first International Conference on Machine Learning , year=
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws , author=. Forty-first International Conference on Machine Learning , year=
-
[40]
The Thirteenth International Conference on Learning Representations , year=
Language models scale reliably with over-training and on downstream tasks , author=. The Thirteenth International Conference on Learning Representations , year=
-
[41]
International Conference on Learning Representations , year=
Deep Double Descent: Where Bigger Models and More Data Hurt , author=. International Conference on Learning Representations , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.